Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

Business Understanding

First things first, what is a mail-order sales company?

According to britannica, a mail-order business is a

method of merchandising in which the seller’s offer is made through mass mailing of a circular or catalog or through an advertisement placed in a newspaper or magazine and in which the buyer places an order by mail.

What are we supposed do in this project?

  1. We need to compare the population of customers with the general population
  2. We need find out which targeted individuals in the new mailout campaign are more likely to convert

How can we compare the populations of customers with the general population?

  1. First we shall explore the data to get to know the features within it
  2. Then we shall segment the general population
  3. Finally we shall figure out which segments do our customers belong to

How can we find out which targeted individuals in the new campaign are likely to convert?

  1. By predicting the segment of these targeted individuals
  2. We can find out if they belong to our customers segments or other segments in the population
In [1]:
# update joblib
!pip install --ignore-installed joblib==0.14
Collecting joblib==0.14
  Using cached https://files.pythonhosted.org/packages/8f/42/155696f85f344c066e17af287359c9786b436b1bf86029bb3411283274f3/joblib-0.14.0-py2.py3-none-any.whl
Installing collected packages: joblib
Successfully installed joblib-0.14.0
In [2]:
# update sklearn
!pip install --upgrade scikit-learn
Requirement already up-to-date: scikit-learn in /opt/conda/lib/python3.6/site-packages (0.24.2)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /opt/conda/lib/python3.6/site-packages (from scikit-learn) (1.2.1)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.6/site-packages (from scikit-learn) (2.2.0)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /opt/conda/lib/python3.6/site-packages (from scikit-learn) (1.19.5)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/lib/python3.6/site-packages (from scikit-learn) (0.14.0)
In [3]:
# install imblearn
!pip install imblearn
Requirement already satisfied: imblearn in /opt/conda/lib/python3.6/site-packages (0.0)
Requirement already satisfied: imbalanced-learn in /opt/conda/lib/python3.6/site-packages (from imblearn) (0.8.1)
Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.6/site-packages (from imbalanced-learn->imblearn) (0.14.0)
Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.6/site-packages (from imbalanced-learn->imblearn) (1.19.5)
Requirement already satisfied: scipy>=0.19.1 in /opt/conda/lib/python3.6/site-packages (from imbalanced-learn->imblearn) (1.2.1)
Requirement already satisfied: scikit-learn>=0.24 in /opt/conda/lib/python3.6/site-packages (from imbalanced-learn->imblearn) (0.24.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.6/site-packages (from scikit-learn>=0.24->imbalanced-learn->imblearn) (2.2.0)
In [4]:
# import libraries here; add more as necessary
import pickle
import warnings
import gc
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from collections import defaultdict

from datetime import datetime

from imblearn.ensemble import BalancedRandomForestClassifier, BalancedBaggingClassifier
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

from sklearn.base import clone
from sklearn.cluster import MiniBatchKMeans
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA, IncrementalPCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, SelectPercentile, RFE, VarianceThreshold, chi2
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix, classification_report, roc_auc_score
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

from tqdm import tqdm

# stop warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# magic word for producing visualizations in notebook
%matplotlib inline

read_pickle = False
seed = 42
In [5]:
pd.set_option('max_columns', 120)
pd.set_option('max_colwidth', 5000)
plt.rcParams['figure.dpi'] = 120

Functions

In [6]:
def reduce_mem_usage(props):
    start_mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage of properties dataframe is :",start_mem_usg," MB")
    NAlist = [] # Keeps track of columns that have missing values filled in. 
    for col in props.columns:
        if props[col].dtype != object:  # Exclude strings
            
            # Print current column type
            print("******************************")
            print("Column: ",col)
            print("dtype before: ",props[col].dtype)
            
            # make variables for Int, max and min
            IsInt = False
            mx = props[col].max()
            mn = props[col].min()
            
            # Integer does not support NA, therefore, NA needs to be filled
            if not np.isfinite(props[col]).all():
                print("Filled NA with", mx+1)
                NAlist.append(col)
                props[col].fillna(mx+1, inplace=True)  
                   
            # test if column can be converted to an integer
            asint = props[col].fillna(0).astype(np.int64)
            result = (props[col] - asint)
            result = result.sum()
            if result > -0.01 and result < 0.01:
                IsInt = True

            
            # Make Integer/unsigned Integer datatypes
            if IsInt:
                # Refill with NA
                if col in NAlist:
                    print(f"Replaced {mx+1} with NA")
                    props[col].replace(mx+1, np.nan, inplace=True)
                    props.loc[:, col] = props[col].astype(np.float16)
                    
                elif mn >= 0:
                    if mx < 255:
                        props.loc[:, col] = props[col].astype(np.uint8)
                    elif mx < 65535:
                        props.loc[:, col] = props[col].astype(np.uint16)
                    elif mx < 4294967295:
                        props.loc[:, col] = props[col].astype(np.uint32)
                    else:
                        props.loc[:, col] = props[col].astype(np.uint64)
                        
                else:
                    if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
                        props.loc[:, col] = props[col].astype(np.int8)
                    elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
                        props.loc[:, col] = props[col].astype(np.int16)
                    elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
                        props.loc[:, col] = props[col].astype(np.int32)
                    elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
                        props.loc[:, col] = props[col].astype(np.int64)
            
            # Make float datatypes 16 bit
            else:
                props.loc[:, col] = props[col].astype(np.float16)
            
            # Print new column type
            print("dtype after: ",props[col].dtype)
            print("******************************")
    
    # Print final result
    print("___MEMORY USAGE AFTER COMPLETION:___")
    mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage is: ",mem_usg," MB")
    print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
    return props, NAlist


def get_null_prop(df, axis=0, plot=True):
    """Calculates null proportions in dataframe columns or rows."""
    # calculate null proportion of each column or row
    null_prop = (df.isnull().sum(axis=axis) / df.shape[axis]).sort_values()
    if plot:
        null_prop.hist(edgecolor='black', linewidth=1.2)    
    return null_prop


def replace_unknown_with_null(df):
    """
    Replace unknown values encoded as 0 and -1 into null values.
    """
    # Read Values sheets which has information about unknown encodings
    dias_vals = pd.read_excel("DIAS Attributes - Values 2017.xlsx")
    
    # Find unknown values for features
    feat_unknown_vals = dias_vals.query("Meaning == 'unknown'")
    
    # Replace unknown values for each feature in the dataframe
    for feat in feat_unknown_vals.itertuples():
        # Check if feature in dataframe featuers
        if feat.Attribute in df.columns:
            # if unknown values are more than one
            if ',' in str(feat.Value):
                # loop over unknown values
                for val in str(feat.Value).split(','):
                    # replace unknown value with null
                    df[feat.Attribute].replace(eval(val), np.nan, inplace=True)
            else:
                # replace unknown value with null
                df[feat.Attribute].replace(feat.Value, np.nan, inplace=True) 
    
    # Replace other unknown values that aren't in Values sheet
    df["ALTER_HH"].replace(0, np.nan, inplace=True)
    df["KOMBIALTER"].replace(0, np.nan, inplace=True)
    
    # Replace Non-binary features with 0 and -1 that are of unknown category to null
    unknown_feats = df[list(set(df.columns).difference(dias_vals.Attribute))].copy()
    for feat in unknown_feats:
        if df[feat].nunique() > 2:
            df[feat].replace(-1, np.nan, inplace=True)
            df[feat].replace(0, np.nan, inplace=True)
    
    return df


def compare_features(dfs=[], labels=[]):
    """Plots all features of passes dataframes for comparison"""
    # parameters of subplot
    nrows = 1
    ncols = 2

    # number of features
    nfeats = feat_cat_df.shape[0]
    
    # colors 
    colors = "bgrcmy"
    

    # loop over 4 features at a time to plot them in a row
    for i in range(0, nfeats, ncols):

        # make subplots
        fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(16, 3));

        # loop over each of 4 features
        for j, feat_idx in enumerate(range(i, i+ncols)):
            # feature name and category
            feat = feat_cat_df.iloc[feat_idx].feature
            cat = feat_cat_df.iloc[feat_idx].category
            
            # ldict with label as key and df feature as value
            dfs_feat = {}
            for df, label in zip(dfs, labels):
                dfs_feat[label] = df.loc[:, feat]

            # plot using histogram if unique values exceeds 11
            if dfs_feat[label].nunique() > 16:
                for (label, df_feat), color in zip(dfs_feat.items(), colors):
                    df_feat.hist(bins=20, color=color, density=True, alpha=0.4, ax=axes[j], label=label)
                axes[j].legend()

            # plot using bar chart
            else:
                columns = []
                feats_p = None
                # concatenate all df features values counts in one dataframe and plot bar plot
                for label, df_feat in dfs_feat.items():
                    columns.append(label)
                    feat_p = df_feat.value_counts()/df_feat.shape[0]
                    feats_p = pd.concat([feats_p, feat_p], axis=1)
                feats_p.columns = columns
                feats_p.sort_index().plot(kind="bar", ax=axes[j])

            # assign unknown feature description 
            feat_desc = 'Unknown'

            # if feature description exists include instead of unknown
            if feat in known_feats:
                feat_desc = dias_atts[dias_atts["Attribute"] == feat]["Description"].item()

            # set title as feature, category and description
            axes[j].set_title(f"{feat}\n{cat}\n{feat_desc}")
            
            
def clean_dataset(df, p_row=0.5, p_col=0.5, drop_uncertain=True, keep_features=[]):
    """
    Clean dataset using insights gained during EDA.
    
    inputs
    
    1. df (pandas dataframe)
    2. p_thresh (float)  -  maximum threshold of null values in columns
    3. drop_uncertain (bool)  -  drop features we are uncertain from
    """
    # Make a new copy of the dataframe
    clean_df = df.copy()
    
    # Clean columns with mixed dtypes
    clean_df["CAMEO_DEUG_2015"].replace('X', np.nan, inplace=True)
    clean_df["CAMEO_INTL_2015"].replace('XX', np.nan, inplace=True)
    
    # Replace unknown values with missing
    clean_df = replace_unknown_with_null(clean_df)   

    # Drop rows with more than 50% missing values
    min_count = int(((1 - p_row))*clean_df.shape[1] + 1)
    clean_df.dropna(axis=0, thresh=min_count, inplace=True)
    
    # Drop duplicated rows
    clean_df.drop_duplicates(inplace=True)
       
    # Drop GEBURTSJAHR (year of birth) that has 44% missing values 
    clean_df.drop('GEBURTSJAHR', axis=1, inplace=True)
      
    # Drop LNR which is a unique indentifier
    clean_df.drop('LNR', axis=1, inplace=True)
    
    # Drop CAMEO_DEU_2015 as it's not suitable for PCA
    clean_df.drop('CAMEO_DEU_2015', axis=1, inplace=True)
    
    # Drop features with more than p_thresh missing values
    features_missing_p = clean_df.isna().sum() / clean_df.shape[0]
    features_above_p_thresh = clean_df.columns[features_missing_p > p_col]
    
    features_to_keep = ["ALTER_KIND1", "ALTER_KIND2", "ALTER_KIND3", "ALTER_KIND4", "ANZ_KINDER"] + keep_features
    features_to_remove = [feat for feat in features_above_p_thresh if feat not in features_to_keep]
    
    clean_df.drop(features_to_remove, axis=1, inplace=True)
    
    # Drop uncertain features
    if drop_uncertain:
        uncertain_features = ["GEMEINDETYP"]
        clean_df.drop(uncertain_features, axis=1, inplace=True)
        
    # Feature Engineering
    # One Hot Encoding D19_LETZTER_KAUF_BRANCHE
    dummies = pd.get_dummies(clean_df["D19_LETZTER_KAUF_BRANCHE"], prefix="D19_LETZTER_KAUF_BRANCHE")
    clean_df = pd.concat([clean_df, dummies], axis=1)
    clean_df.drop("D19_LETZTER_KAUF_BRANCHE", axis=1, inplace=True)
    
    # Calculate year difference in MIN_GEBAEUDEJAHR
    clean_df["MIN_GEBAEUDEJAHR_ENG"] = (2017 - clean_df["MIN_GEBAEUDEJAHR"])
    clean_df.drop("MIN_GEBAEUDEJAHR", axis=1, inplace=True)
    
    # Calculate days difference in EINGEFUEGT_AM
    current = datetime.strptime("2017-01-01", "%Y-%m-%d")
    clean_df["EINGEFUEGT_AM_DAY"] = (current - pd.to_datetime(clean_df["EINGEFUEGT_AM"])).dt.days
    clean_df.drop("EINGEFUEGT_AM", axis=1, inplace=True)
    
    # Replace null values in ALTER_KIND and ANZ_KINDER with 0 to avoid imputation
    for feat in clean_df.columns[clean_df.columns.str.startswith("ALTER_KIND")]:
        clean_df[feat].replace(np.nan, 0, inplace=True)
    clean_df["ANZ_KINDER"].replace(np.nan, 0, inplace=True)
    
    # Convert OST_WEST_KZ to binary labels
    clean_df["OST_WEST_KZ"] = (clean_df["OST_WEST_KZ"] == "W").astype(np.uint8)
    
    # Convert CAMEO_INTL_2015 and CAMEO_DEUG_2015 to float32
    CAMEO_feats = ["CAMEO_INTL_2015", "CAMEO_DEUG_2015"]
    clean_df[CAMEO_feats] = clean_df[CAMEO_feats].astype(np.float32)

    # Convert float16 features to float32 to enable arithmetic operations
    float_feats = clean_df.select_dtypes(np.float16).columns
    clean_df[float_feats] = clean_df[float_feats].astype(np.float32)
    
    return clean_df    


def batch_fit_scaler(scaler, data, n_batches=100):
    """
    Fit a scaler to data through n batches.
    
    Input:
    transfromer (scikit-learn Scaler object)
    data (numpy array or pandas dataframe)
    n_batches (int)  -  number of batches for fitting the scaler
    
    Output:
    Fitted Scaler
    """
    for X_batch in tqdm(np.array_split(data, n_batches), total=n_batches):
        scaler.partial_fit(X_batch)
        
    return scaler


def batch_fit_pca(pca, scaler, data, n_batches=100):
    """
    Fit an Incremental PCA to scaled data through n batches.
    
    Input:
    pca (scikit-learn IncrementalPCA object)
    scaler (scikit-learn Scaler object)
    data (numpy array or pandas dataframe)
    n_batches (int)  -  number of batches for fitting the transformer
    
    Output:
    Fitted IncrementalPCA
    """
    for X_batch in tqdm(np.array_split(data, n_batches), total=n_batches):
        scaled_X_batch = scaler.transform(X_batch)
        pca.partial_fit(scaled_X_batch)
    
    return pca
        

def batch_transform_pca(pca, scaler, data, n_batches=100):
    """
    Transform large data using fitted pca.
    
    Input:
    pca (Fitted IncrementalPCA)
    scaler (Fitted Scaler)
    data (numpy array or pandas dataframe)
    n_batches (int)  -  number of batches for fitting the transformer
    
    Output:
    Transformed data
    """
    pca_data = None
    for X_batch in tqdm(np.array_split(data, n_batches), total=n_batches):
        scaled_X_batch = scaler.transform(X_batch)
        pca_X_batch = pca.transform(scaled_X_batch)
        if pca_data is None:
            pca_data = pca_X_batch
        else: 
            pca_data = np.vstack([pca_data, pca_X_batch])
    return pca_data
        

def plot_multi_bar(feat, dfs_feat, ax, color):
    columns = []
    feats_p = None
    # concatenate all df features values counts in one dataframe and plot bar plot
    for label, df_feat in dfs_feat.items():
        columns.append(label)
        feat_p = df_feat.value_counts()/df_feat.shape[0]
        feats_p = pd.concat([feats_p, feat_p], axis=1)
    feats_p.columns = columns
    
    # map value ticks to their meanings
    vals_dict = get_feat_val_meaning(feat)
    feats_p.sort_index(inplace=True)
    feats_p.index = feats_p.index.map(vals_dict)
    
    feats_p.plot(kind="bar", color=color, ax=ax); 
    
    
def get_feat_val_meaning(feat):
    # filter out featue
    feat_vals = dias_vals[dias_vals["Attribute"] == feat]

    # make a series of Value and Meaning with Value as index
    feat_vals_meaning = feat_vals[["Value", "Meaning"]].set_index("Value")

    # convert series to dict 
    vals_dict = feat_vals_meaning.to_dict()["Meaning"]   
    
    return vals_dict
    
    
def compare_feature(feat, title):
    fig, axes = plt.subplots(1, 2, figsize=(16, 4))

    # plot feature for customers an non-customers
    dfs_feat = {"Customers": customers[feat],
                "Non-Customers": non_customers[feat]}

    plot_multi_bar(feat, dfs_feat, axes[0], ["tab:blue", "grey"])
    
    # set titles of the subplots
    axes[0].set_title(f"{title} between Customers and Non-Customers")
    axes[1].set_title(f"{title} between Customer Clusters")



    # plot feature for customers' clusters
    dfs_feat = {"Cluster 0": cluster_0[feat],
                "Cluster 4": cluster_4[feat],
                "Cluster 6": cluster_6[feat]}

    plot_multi_bar(feat, dfs_feat, axes[1], ["#878cad", "#4b527c", "#323752"])
    
    
def model_validation(model, X, y, fold_results=False, final_results=True, return_scores=False, return_preds=False, plot_confusion=False, metric=roc_auc_score):
    """Evaluate model using sklearn's classification report."""
    # Instantiate StratifiedKFold 
    skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=seed)

    # Make empty y_pred to fill predictions of each fold
    y_pred = np.zeros(y.shape)
    
    # Initialize empty list for scores
    scores = []

    for i, (train_idx, test_idx) in enumerate(skf.split(X, y)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
        
        # Make copy of model
        model_clone = clone(model)

        # Fit model
        model_clone.fit(X_train, y_train)

        # Predict fold y_test and add in y_pred
        fold_pred = model_clone.predict(X_test)
        y_pred[test_idx] += fold_pred

        # Print classification report
        if fold_results:
            print("Fold:", i+1)
            print(classification_report(y_test, fold_pred))
        
        # Calculate metric scores (default is roc_auc_score)
        scores.append(metric(y_test, fold_pred, average="macro"))

    # Print final classification report
    if final_results:
        print("Final Report:")
        print(classification_report(y, y_pred))
        
        print("Metric Score:", np.mean(scores))
    
    # Return metric scores if passed to the function
    if return_scores:
        return scores
    
    # Plot confusion matrix if specified
    if plot_confusion:
        ax = plt.subplot()
        cmat = confusion_matrix(y, y_pred)
        sns.heatmap(cmat, annot=True, fmt="g", ax=ax)
        
        ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
        ax.set_title('Confusion Matrix'); 
        ax.xaxis.set_ticklabels(['No Response', 'Response']); ax.yaxis.set_ticklabels(['No Response', 'Response']);
        

def evaluate_models(models_dict, X, y, scoring="f1_macro"):
    """Evaluate several models using sklearn's cross_val_score."""
    model_scores = {}
    for name, model in models_dict.items():
        skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=seed)
        scores = cross_val_score(model, X, y, cv=skf, scoring=scoring)
        model_scores[name] = scores
        print("Model:%s, Score:%.3f (+/- %.3f)" % (name, np.mean(scores), np.std(scores)))
    return model_scores

Part 0: Get to Know the Data

There are four data files associated with this project:

  • Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
  • Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
  • Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
  • Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. One of them is a top-level list of attributes and descriptions, organized by informational category. The other is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the .csv data files in this project that they're semicolon (;) delimited, so an additional argument in the read_csv() call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

In [7]:
# csv data path
data_path = '../../data/Term2/capstone/arvato_data/'

# csv files
azdias_csv = 'Udacity_AZDIAS_052018.csv'
customers_csv = 'Udacity_CUSTOMERS_052018.csv'

# csv file paths
azdias_csv_path = os.path.join(data_path, azdias_csv)
customers_csv_path = os.path.join(data_path, customers_csv)

# pickle files
azdias_pickle = 'Udacity_AZDIAS_052018.csv.pandas.pickle'
customers_pickle = 'Udacity_CUSTOMERS_052018.csv.pandas.pickle'
In [8]:
# load pickled datasets if exists
if azdias_pickle in os.listdir() and customers_pickle in os.listdir():
    print("Loading AZDIAS pickle...")
    azdias = pd.read_pickle(azdias_pickle)
    
    print("Loading CUSTOMERS pickle...")
    customers = pd.read_pickle(customers_pickle)
    
    read_pickle = True

# else load csv and save pickles
else:
    print("Loading AZDIAS csv...")
    azdias = pd.read_csv(azdias_csv_path, sep=';')

    print("Loading CUSTOMERS csv...")
    customers = pd.read_csv(customers_csv_path, sep=';')
    
print("Loading Attributes Sheet...")
dias_atts = pd.read_excel("DIAS Information Levels - Attributes 2017.xlsx")
    
print("Loading Values Sheet...")
dias_vals = pd.read_excel("DIAS Attributes - Values 2017.xlsx")
    
print("Done.")
Loading AZDIAS pickle...
Loading CUSTOMERS pickle...
Loading Attributes Sheet...
Loading Values Sheet...
Done.

First, we need to pay attention to the dtype warning. It says that columns 18 and 19 have mixed types.

In [9]:
azdias.iloc[:, [18, 19]].head()
Out[9]:
CAMEO_DEUG_2015 CAMEO_INTL_2015
0 NaN NaN
1 8 51
2 4 24
3 2 12
4 6 43
In [10]:
customers.iloc[:, [18, 19]].head()
Out[10]:
CAMEO_DEUG_2015 CAMEO_INTL_2015
0 1 13
1 NaN NaN
2 5 34
3 4 24
4 7 41
In [11]:
azdias['CAMEO_DEUG_2015'].value_counts()
Out[11]:
8      78023
9      62578
6      61253
4      60185
8.0    56418
3      50360
2      48276
9.0    45599
7      45021
6.0    44621
4.0    43727
3.0    36419
2.0    34955
7.0    32912
5      32292
5.0    23018
1      20997
1.0    15215
X        373
Name: CAMEO_DEUG_2015, dtype: int64
In [12]:
customers['CAMEO_DEUG_2015'].value_counts()
Out[12]:
2      17574
4      16458
6      14008
3      13585
1      12498
8       9716
5       8624
7       7878
2.0     5910
4.0     5606
3.0     4805
9       4731
6.0     4709
1.0     4280
8.0     3333
5.0     3042
7.0     2680
9.0     1661
X        126
Name: CAMEO_DEUG_2015, dtype: int64
In [13]:
azdias['CAMEO_INTL_2015'].value_counts()
Out[13]:
51      77576
51.0    56118
41      53459
24      52882
41.0    38877
24.0    38276
14      36524
43      32730
14.0    26360
54      26207
43.0    23942
25      22837
54.0    19184
22      19173
25.0    16791
23      15653
13      15272
45      15206
22.0    13982
55      13842
52      11836
23.0    11097
13.0    11064
31      11041
45.0    10926
34      10737
55.0    10113
15       9832
52.0     8706
44       8543
31.0     7983
34.0     7787
12       7645
15.0     7142
44.0     6277
35       6090
32       6067
33       5833
12.0     5604
32.0     4287
35.0     4266
33.0     4102
XX        373
Name: CAMEO_INTL_2015, dtype: int64
In [14]:
customers['CAMEO_INTL_2015'].value_counts()
Out[14]:
14      14708
24      13301
41       8461
43       7158
25       6900
15       6845
51       5987
13       5728
22       5566
14.0     4939
24.0     4504
23       4276
34       3945
45       3936
54       3537
41.0     2859
55       2794
12       2791
43.0     2476
25.0     2472
15.0     2372
44       2144
51.0     2126
31       2050
13.0     1955
22.0     1941
35       1741
23.0     1494
34.0     1423
45.0     1352
54.0     1258
32       1256
33       1178
12.0      924
55.0      920
52        770
44.0      688
31.0      681
35.0      553
32.0      440
33.0      396
52.0      253
XX        126
Name: CAMEO_INTL_2015, dtype: int64

There are 373 rows in AZDIAS that have X in CAMEO_DEUG_2015 and XX in CAMEO_INTL_2015, while there are 126 rows in customers that have XX in the same two features.

What do these features represent?

According to TransunionUK, CAMEO is a consumer segmentation system linking address information to demographic, lifestyle and socio-economic insight. So it's basically data obtained from a different information source, and that's why rows that have missing value in one feature have it also missing in the other.

We can fill these rows with NaNs for now and continue with exploring the dataset.

In [15]:
feats = ["CAMEO_INTL_2015", "CAMEO_DEUG_2015"]

for feat in feats:
    azdias[feat].replace({"X": np.nan}, inplace=True)
    customers[feat].replace({"XX": np.nan}, inplace=True)

What does the datasets look like?

In [16]:
# looking into the general population dataset
print('Shape:', azdias.shape)
azdias.head()
Shape: (891221, 366)
Out[16]:
LNR AGER_TYP AKT_DAT_KL ALTER_HH ALTER_KIND1 ALTER_KIND2 ALTER_KIND3 ALTER_KIND4 ALTERSKATEGORIE_FEIN ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL ANZ_KINDER ANZ_PERSONEN ANZ_STATISTISCHE_HAUSHALTE ANZ_TITEL ARBEIT BALLRAUM CAMEO_DEU_2015 CAMEO_DEUG_2015 CAMEO_INTL_2015 CJT_GESAMTTYP CJT_KATALOGNUTZER CJT_TYP_1 CJT_TYP_2 CJT_TYP_3 CJT_TYP_4 CJT_TYP_5 CJT_TYP_6 D19_BANKEN_ANZ_12 D19_BANKEN_ANZ_24 D19_BANKEN_DATUM D19_BANKEN_DIREKT D19_BANKEN_GROSS D19_BANKEN_LOKAL D19_BANKEN_OFFLINE_DATUM D19_BANKEN_ONLINE_DATUM D19_BANKEN_ONLINE_QUOTE_12 D19_BANKEN_REST D19_BEKLEIDUNG_GEH D19_BEKLEIDUNG_REST D19_BILDUNG D19_BIO_OEKO D19_BUCH_CD D19_DIGIT_SERV D19_DROGERIEARTIKEL D19_ENERGIE D19_FREIZEIT D19_GARTEN D19_GESAMT_ANZ_12 D19_GESAMT_ANZ_24 D19_GESAMT_DATUM D19_GESAMT_OFFLINE_DATUM D19_GESAMT_ONLINE_DATUM D19_GESAMT_ONLINE_QUOTE_12 D19_HANDWERK D19_HAUS_DEKO D19_KINDERARTIKEL D19_KONSUMTYP D19_KONSUMTYP_MAX D19_KOSMETIK ... LP_FAMILIE_GROB LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_STATUS_FEIN LP_STATUS_GROB MIN_GEBAEUDEJAHR MOBI_RASTER MOBI_REGIO NATIONALITAET_KZ ONLINE_AFFINITAET ORTSGR_KLS9 OST_WEST_KZ PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_GBZ PLZ8_HHZ PRAEGENDE_JUGENDJAHRE REGIOTYP RELAT_AB RETOURTYP_BK_S RT_KEIN_ANREIZ RT_SCHNAEPPCHEN RT_UEBERGROESSE SEMIO_DOM SEMIO_ERL SEMIO_FAM SEMIO_KAEM SEMIO_KRIT SEMIO_KULT SEMIO_LUST SEMIO_MAT SEMIO_PFLICHT SEMIO_RAT SEMIO_REL SEMIO_SOZ SEMIO_TRADV SEMIO_VERT SHOPPER_TYP SOHO_KZ STRUKTURTYP TITEL_KZ UMFELD_ALT UMFELD_JUNG UNGLEICHENN_FLAG VERDICHTUNGSRAUM VERS_TYP VHA VHN VK_DHT4A VK_DISTANZ VK_ZG11 W_KEIT_KIND_HH WOHNDAUER_2008 WOHNLAGE ZABEOTYP ANREDE_KZ ALTERSKATEGORIE_GROB
0 910215 -1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 5.0 1.0 1.0 5.0 5.0 5.0 5.0 0 0 10 0 0 0 10 10 NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 10 10 10 NaN 0 0 0 NaN 9 0 ... 2.0 15.0 4.0 1.0 1.0 NaN NaN NaN 0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN 5.0 1.0 4.0 1.0 6 3 6 6 7 3 5 5 5 4 7 2 3 1 -1 NaN NaN NaN NaN NaN NaN NaN -1 NaN NaN NaN NaN NaN NaN NaN NaN 3 1 2
1 910220 -1 9.0 0.0 NaN NaN NaN NaN 21.0 11.0 0.0 0.0 2.0 12.0 0.0 3.0 6.0 8A 8 51 5.0 1.0 5.0 5.0 2.0 3.0 1.0 1.0 0 0 10 0 0 0 10 10 NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 10 10 10 NaN 0 0 0 NaN 9 0 ... 3.0 21.0 6.0 2.0 1.0 1992.0 1.0 1.0 1 3.0 5.0 W 2.0 3.0 2.0 1.0 1.0 4.0 5.0 14 3.0 4.0 1.0 5.0 3.0 5.0 7 2 4 4 4 3 2 3 7 6 4 5 6 1 3 1.0 2.0 0.0 3.0 3.0 1.0 0.0 2 0.0 4.0 8.0 11.0 10.0 3.0 9.0 4.0 5 2 1
2 910225 -1 9.0 17.0 NaN NaN NaN NaN 17.0 10.0 0.0 0.0 1.0 7.0 0.0 3.0 2.0 4C 4 24 3.0 2.0 4.0 4.0 1.0 3.0 2.0 2.0 0 0 10 0 0 0 10 10 0.0 0 0 0 6 0 0 0 0 0 0 0 0 0 10 10 10 0.0 0 0 0 9.0 8 6 ... 1.0 3.0 1.0 3.0 2.0 1992.0 2.0 3.0 1 2.0 5.0 W 3.0 3.0 1.0 0.0 1.0 4.0 4.0 15 2.0 2.0 3.0 5.0 4.0 5.0 7 6 1 7 7 3 4 3 3 4 3 4 3 4 2 0.0 3.0 0.0 2.0 5.0 0.0 1.0 1 0.0 2.0 9.0 9.0 6.0 3.0 9.0 2.0 5 2 3
3 910226 2 1.0 13.0 NaN NaN NaN NaN 13.0 1.0 0.0 0.0 0.0 2.0 0.0 2.0 4.0 2A 2 12 2.0 3.0 2.0 2.0 4.0 4.0 5.0 3.0 0 0 10 0 0 0 10 10 0.0 0 0 0 0 0 6 0 0 0 0 0 0 0 10 10 10 0.0 0 0 0 9.0 8 0 ... 0.0 0.0 0.0 9.0 4.0 1997.0 4.0 4.0 1 1.0 3.0 W 2.0 2.0 2.0 0.0 1.0 4.0 3.0 8 0.0 3.0 2.0 3.0 2.0 3.0 4 7 1 5 4 4 4 1 4 3 2 5 4 4 1 0.0 1.0 0.0 4.0 5.0 0.0 0.0 1 1.0 0.0 7.0 10.0 11.0 NaN 9.0 7.0 3 2 4
4 910241 -1 1.0 20.0 NaN NaN NaN NaN 14.0 3.0 0.0 0.0 4.0 3.0 0.0 4.0 2.0 6B 6 43 5.0 3.0 3.0 3.0 3.0 4.0 3.0 3.0 3 5 5 1 2 0 10 5 10.0 6 6 1 6 0 6 0 1 5 0 0 6 6 1 6 1 10.0 0 5 0 1.0 1 0 ... 5.0 32.0 10.0 3.0 2.0 1992.0 1.0 3.0 1 5.0 6.0 W 2.0 4.0 2.0 1.0 2.0 3.0 3.0 8 5.0 5.0 5.0 3.0 5.0 5.0 2 4 4 2 3 6 4 2 4 2 4 6 2 7 2 0.0 3.0 0.0 4.0 3.0 0.0 1.0 2 0.0 2.0 3.0 5.0 4.0 2.0 9.0 3.0 4 1 3

5 rows × 366 columns

In [17]:
# looking into the customers dataset
print('Shape:', customers.shape)
customers.head()
Shape: (191652, 369)
Out[17]:
LNR AGER_TYP AKT_DAT_KL ALTER_HH ALTER_KIND1 ALTER_KIND2 ALTER_KIND3 ALTER_KIND4 ALTERSKATEGORIE_FEIN ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL ANZ_KINDER ANZ_PERSONEN ANZ_STATISTISCHE_HAUSHALTE ANZ_TITEL ARBEIT BALLRAUM CAMEO_DEU_2015 CAMEO_DEUG_2015 CAMEO_INTL_2015 CJT_GESAMTTYP CJT_KATALOGNUTZER CJT_TYP_1 CJT_TYP_2 CJT_TYP_3 CJT_TYP_4 CJT_TYP_5 CJT_TYP_6 D19_BANKEN_ANZ_12 D19_BANKEN_ANZ_24 D19_BANKEN_DATUM D19_BANKEN_DIREKT D19_BANKEN_GROSS D19_BANKEN_LOKAL D19_BANKEN_OFFLINE_DATUM D19_BANKEN_ONLINE_DATUM D19_BANKEN_ONLINE_QUOTE_12 D19_BANKEN_REST D19_BEKLEIDUNG_GEH D19_BEKLEIDUNG_REST D19_BILDUNG D19_BIO_OEKO D19_BUCH_CD D19_DIGIT_SERV D19_DROGERIEARTIKEL D19_ENERGIE D19_FREIZEIT D19_GARTEN D19_GESAMT_ANZ_12 D19_GESAMT_ANZ_24 D19_GESAMT_DATUM D19_GESAMT_OFFLINE_DATUM D19_GESAMT_ONLINE_DATUM D19_GESAMT_ONLINE_QUOTE_12 D19_HANDWERK D19_HAUS_DEKO D19_KINDERARTIKEL D19_KONSUMTYP D19_KONSUMTYP_MAX D19_KOSMETIK ... LP_STATUS_FEIN LP_STATUS_GROB MIN_GEBAEUDEJAHR MOBI_RASTER MOBI_REGIO NATIONALITAET_KZ ONLINE_AFFINITAET ORTSGR_KLS9 OST_WEST_KZ PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_GBZ PLZ8_HHZ PRAEGENDE_JUGENDJAHRE REGIOTYP RELAT_AB RETOURTYP_BK_S RT_KEIN_ANREIZ RT_SCHNAEPPCHEN RT_UEBERGROESSE SEMIO_DOM SEMIO_ERL SEMIO_FAM SEMIO_KAEM SEMIO_KRIT SEMIO_KULT SEMIO_LUST SEMIO_MAT SEMIO_PFLICHT SEMIO_RAT SEMIO_REL SEMIO_SOZ SEMIO_TRADV SEMIO_VERT SHOPPER_TYP SOHO_KZ STRUKTURTYP TITEL_KZ UMFELD_ALT UMFELD_JUNG UNGLEICHENN_FLAG VERDICHTUNGSRAUM VERS_TYP VHA VHN VK_DHT4A VK_DISTANZ VK_ZG11 W_KEIT_KIND_HH WOHNDAUER_2008 WOHNLAGE ZABEOTYP PRODUCT_GROUP CUSTOMER_GROUP ONLINE_PURCHASE ANREDE_KZ ALTERSKATEGORIE_GROB
0 9626 2 1.0 10.0 NaN NaN NaN NaN 10.0 1.0 0.0 0.0 2.0 1.0 0.0 1.0 3.0 1A 1 13 5.0 4.0 1.0 1.0 5.0 5.0 5.0 5.0 0 0 10 0 0 0 10 10 0.0 0 0 0 0 0 6 0 0 0 0 0 0 0 9 9 10 0.0 0 6 0 3.0 2 0 ... 10.0 5.0 1992.0 3.0 4.0 1 3.0 2.0 W 3.0 3.0 1.0 0.0 1.0 5.0 5.0 4 1.0 1.0 5.0 1.0 5.0 3.0 1 3 5 1 3 4 7 6 2 1 2 6 1 6 3 0.0 3.0 0.0 4.0 4.0 0.0 8.0 1 0.0 3.0 5.0 3.0 2.0 6.0 9.0 7.0 3 COSMETIC_AND_FOOD MULTI_BUYER 0 1 4
1 9628 -1 9.0 11.0 NaN NaN NaN NaN NaN NaN NaN 0.0 3.0 NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 1 6 0 5 0 10 10 0.0 6 0 0 0 0 0 0 6 0 0 0 0 1 6 10 9 0.0 0 0 0 5.0 3 0 ... NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN 3 3 6 2 3 4 5 6 4 1 2 3 1 7 3 0.0 NaN 0.0 NaN NaN 0.0 NaN 1 0.0 NaN 6.0 6.0 3.0 0.0 9.0 NaN 3 FOOD SINGLE_BUYER 0 1 4
2 143872 -1 1.0 6.0 NaN NaN NaN NaN 0.0 1.0 0.0 0.0 1.0 1.0 0.0 3.0 7.0 5D 5 34 2.0 5.0 2.0 2.0 5.0 5.0 5.0 5.0 0 0 10 0 0 0 10 10 0.0 0 0 0 6 0 0 0 0 0 0 0 0 0 10 10 10 0.0 0 0 0 3.0 2 0 ... 10.0 5.0 1992.0 1.0 3.0 1 1.0 5.0 W 2.0 3.0 3.0 1.0 3.0 2.0 3.0 4 7.0 3.0 5.0 1.0 5.0 1.0 5 7 2 6 7 1 7 3 4 2 1 2 1 3 1 0.0 3.0 0.0 1.0 5.0 0.0 0.0 2 0.0 4.0 10.0 13.0 11.0 6.0 9.0 2.0 3 COSMETIC_AND_FOOD MULTI_BUYER 0 2 4
3 143873 1 1.0 8.0 NaN NaN NaN NaN 8.0 0.0 NaN 0.0 0.0 1.0 0.0 1.0 7.0 4C 4 24 2.0 5.0 1.0 1.0 5.0 5.0 5.0 5.0 0 0 10 0 0 0 10 10 0.0 0 0 0 0 0 6 0 0 0 0 0 0 1 6 6 10 0.0 0 0 0 3.0 2 0 ... 9.0 4.0 1992.0 3.0 4.0 1 2.0 3.0 W 3.0 2.0 1.0 0.0 1.0 4.0 3.0 1 6.0 1.0 3.0 1.0 5.0 2.0 3 3 5 3 3 4 5 4 3 3 3 6 4 7 0 0.0 1.0 0.0 3.0 4.0 0.0 0.0 1 0.0 2.0 6.0 4.0 2.0 NaN 9.0 7.0 1 COSMETIC MULTI_BUYER 0 1 4
4 143874 -1 1.0 20.0 NaN NaN NaN NaN 14.0 7.0 0.0 0.0 4.0 7.0 0.0 3.0 3.0 7B 7 41 6.0 4.0 3.0 3.0 3.0 4.0 3.0 3.0 1 2 3 5 0 3 10 7 0.0 0 0 6 0 0 2 0 4 0 6 0 3 5 1 8 1 10.0 0 6 0 1.0 4 0 ... 1.0 1.0 1992.0 1.0 3.0 1 5.0 5.0 W 2.0 4.0 2.0 1.0 2.0 3.0 3.0 8 7.0 1.0 5.0 4.0 3.0 5.0 5 4 5 2 3 5 6 6 5 5 4 4 4 5 1 0.0 3.0 0.0 2.0 4.0 0.0 1.0 2 0.0 4.0 3.0 5.0 4.0 2.0 9.0 3.0 1 FOOD MULTI_BUYER 0 1 3

5 rows × 369 columns

First thing that we should ask ourselves is what do we know about the features?

Features dtypes

In [18]:
# general population dataset info
azdias.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Columns: 366 entries, LNR to ALTERSKATEGORIE_GROB
dtypes: float16(267), int8(4), object(6), uint16(1), uint32(1), uint8(87)
memory usage: 577.1+ MB
In [19]:
# customers dataset info
customers.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 191652 entries, 0 to 191651
Columns: 369 entries, LNR to ALTERSKATEGORIE_GROB
dtypes: float16(267), int8(4), object(8), uint16(1), uint32(1), uint8(88)
memory usage: 127.2+ MB

We can try to reduce the memory size of the datasets first to imporve their handling.

In [20]:
if not read_pickle:
    azdias_p, azdias_na = reduce_mem_usage(azdias)
In [21]:
if not read_pickle:
    customers_p, customers_na = reduce_mem_usage(customers)
In [22]:
# save datasets in pickle format for later usage
if not read_pickle:
    pd.to_pickle(azdias_p, azdias_pickle)
    pd.to_pickle(customers_p, customers_pickle)

Features in customers not in azdias

In [23]:
azdias_feats = set(azdias.columns)
customers_feats = set(customers.columns)
print("Features in customers not in azdias:")
print(customers_feats.difference(azdias_feats))
Features in customers not in azdias:
{'ONLINE_PURCHASE', 'CUSTOMER_GROUP', 'PRODUCT_GROUP'}

We are lucky enough to have some information provided about the features and their values in the datasets, so let's explore them.

In [24]:
dias_atts.head()
Out[24]:
Unnamed: 0 Information level Attribute Description Additional notes
0 NaN NaN AGER_TYP best-ager typology in cooperation with Kantar TNS; the information basis is a consumer survey
1 NaN Person ALTERSKATEGORIE_GROB age through prename analysis modelled on millions of first name-age-reference data
2 NaN NaN ANREDE_KZ gender NaN
3 NaN NaN CJT_GESAMTTYP Customer-Journey-Typology relating to the preferred information and buying channels of consumers relating to the preferred information, marketing and buying channels of consumers as well as their cross-channel usage. The information basis is a survey on the consumer channel preferences combined via a statistical modell with AZ DIAS data
4 NaN NaN FINANZ_MINIMALIST financial typology: low financial interest Gfk-Typology based on a representative household panel combined via a statistical modell with AZ DIAS data
In [25]:
dias_vals.head(10)
Out[25]:
Unnamed: 0 Attribute Description Value Meaning
0 NaN AGER_TYP best-ager typology -1 unknown
1 NaN NaN NaN 0 no classification possible
2 NaN NaN NaN 1 passive elderly
3 NaN NaN NaN 2 cultural elderly
4 NaN NaN NaN 3 experience-driven elderly
5 NaN ALTERSKATEGORIE_GROB age classification through prename analysis -1, 0 unknown
6 NaN NaN NaN 1 < 30 years
7 NaN NaN NaN 2 30 - 45 years
8 NaN NaN NaN 3 46 - 60 years
9 NaN NaN NaN 4 > 60 years

How many features are present in Values and Attributes?

In [26]:
print("Number of features in Values:", dias_vals.Attribute.dropna().nunique())
print("Number of features in Attributes:", dias_atts.Attribute.dropna().nunique())
Number of features in Values: 314
Number of features in Attributes: 313

Are there any features in either of them that isn't included in the other?

In [27]:
dias_vals_feats = set(dias_vals.Attribute.dropna())
dias_atts_feats = set(dias_atts.Attribute.dropna())

print("Features in Values not in Attributes:")
for feat in dias_vals_feats.difference(dias_atts_feats):
    print(feat)
    
print("\nFeatures in Attributes not in Values:")
for feat in dias_atts_feats.difference(dias_vals_feats):
    print(feat)
Features in Values not in Attributes:
D19_VERSI_ANZ_12
D19_GESAMT_ANZ_24
BIP_FLAG
KBA13_CCM_3000
D19_VERSAND_ANZ_24
D19_TELKO_ANZ_24
D19_TELKO_ANZ_12
D19_VERSI_ANZ_24
D19_LOTTO_RZ
D19_GESAMT_ANZ_12
D19_BANKEN_ANZ_24
D19_BANKEN_ANZ_12
KBA13_CCM_3001
D19_VERSAND_ANZ_12

Features in Attributes not in Values:
GKZ
D19_VERSI_ ANZ_12                                       D19_VERSI_ ANZ_24
ARBEIT
PLZ
PLZ8
D19_TELKO_ ANZ_12                  D19_TELKO_ ANZ_24
D19_GESAMT_ANZ_12                                    D19_GESAMT_ANZ_24
D19_VERSI_ONLINE_DATUM
D19_VERSAND_ ANZ_12          D19_VERSAND_ ANZ_24
D19_VERSI_DATUM
EINWOHNER
D19_BANKEN_ ANZ_12             D19_BANKEN_ ANZ_24
D19_VERSI_OFFLINE_DATUM

Looks like there are some values that need to be cleaned in the Attributes excel sheet, so let's clean them and answer the question.

In [28]:
# set of features that have wrong format
wrong_feats = dias_atts_feats.difference(dias_vals_feats)

dias_atts.query("Attribute in @wrong_feats")
Out[28]:
Unnamed: 0 Information level Attribute Description Additional notes
50 NaN NaN D19_GESAMT_ANZ_12 D19_GESAMT_ANZ_24 transaction activity TOTAL POOL in the last 12 and 24 months NaN
51 NaN NaN D19_BANKEN_ ANZ_12 D19_BANKEN_ ANZ_24 transaction activity BANKS in the last 12 and 24 months NaN
52 NaN NaN D19_TELKO_ ANZ_12 D19_TELKO_ ANZ_24 transaction activity TELCO in the last 12 and 24 months NaN
53 NaN NaN D19_VERSI_ ANZ_12 D19_VERSI_ ANZ_24 transaction activity INSURANCE in the last 12 and 24 months NaN
54 NaN NaN D19_VERSAND_ ANZ_12 D19_VERSAND_ ANZ_24 transaction activity MAIL-ORDER in the last 12 and 24 months NaN
67 NaN NaN D19_VERSI_OFFLINE_DATUM actuality of the last transaction for the segment insurance OFFLINE NaN
68 NaN NaN D19_VERSI_ONLINE_DATUM actuality of the last transaction for the segment insurance ONLINE NaN
69 NaN NaN D19_VERSI_DATUM actuality of the last transaction for the segment insurance TOTAL NaN
188 NaN NaN PLZ postcode NaN
300 NaN NaN PLZ8 sub-postcode (about 8 PLZ8 make up one PLZ) and \nnew macrocell level (about 500 households) NaN
308 NaN Community ARBEIT share of unemployed person in the community NaN
309 NaN NaN EINWOHNER inhabitants NaN
310 NaN NaN GKZ standardized community-code NaN

It turns out that these features weren't wrongly formatted, but they are features that are probably not in azdias and customers dataset. Let's check this now.

In [29]:
print("Features in Attributes not in azdias:")
for feat in dias_vals_feats.difference(azdias_feats):
    print(feat)
    
print("\nFeatures in Values not in azdias:")
for feat in dias_atts_feats.difference(azdias_feats):
    print(feat)
Features in Attributes not in azdias:
D19_KINDERARTIKEL_RZ
D19_VERSICHERUNGEN_RZ
CAMEO_DEUINTL_2015
D19_TELKO_MOBILE_RZ
SOHO_FLAG
D19_KK_KUNDENTYP
D19_TECHNIK_RZ
D19_LOTTO_RZ
D19_SCHUHE_RZ
HAUSHALTSSTRUKTUR
D19_GARTEN_RZ
D19_BEKLEIDUNG_REST_RZ
D19_BANKEN_GROSS_RZ
D19_BEKLEIDUNG_GEH_RZ
D19_BANKEN_LOKAL_RZ
D19_BANKEN_REST_RZ
D19_NAHRUNGSERGAENZUNG_RZ
D19_VOLLSORTIMENT_RZ
D19_SAMMELARTIKEL_RZ
D19_TELKO_REST_RZ
D19_KOSMETIK_RZ
D19_SONSTIGE_RZ
D19_DROGERIEARTIKEL_RZ
WACHSTUMSGEBIET_NB
D19_ENERGIE_RZ
D19_HAUS_DEKO_RZ
D19_DIGIT_SERV_RZ
D19_RATGEBER_RZ
GEOSCORE_KLS7
D19_FREIZEIT_RZ
D19_VERSAND_REST_RZ
BIP_FLAG
KBA13_CCM_1400_2500
D19_HANDWERK_RZ
D19_BUCH_RZ
D19_BILDUNG_RZ
D19_TIERARTIKEL_RZ
D19_WEIN_FEINKOST_RZ
D19_REISEN_RZ
D19_BIO_OEKO_RZ
D19_LEBENSMITTEL_RZ
D19_BANKEN_DIREKT_RZ

Features in Values not in azdias:
D19_KINDERARTIKEL_RZ
D19_VERSICHERUNGEN_RZ
D19_VERSI_ ANZ_12                                       D19_VERSI_ ANZ_24
CAMEO_DEUINTL_2015
D19_TELKO_MOBILE_RZ
SOHO_FLAG
D19_KK_KUNDENTYP
D19_TECHNIK_RZ
D19_SCHUHE_RZ
D19_VERSAND_ ANZ_12          D19_VERSAND_ ANZ_24
HAUSHALTSSTRUKTUR
D19_GARTEN_RZ
D19_BEKLEIDUNG_REST_RZ
D19_BANKEN_GROSS_RZ
D19_BANKEN_ ANZ_12             D19_BANKEN_ ANZ_24
D19_BEKLEIDUNG_GEH_RZ
D19_BANKEN_LOKAL_RZ
D19_BANKEN_REST_RZ
D19_NAHRUNGSERGAENZUNG_RZ
D19_GESAMT_ANZ_12                                    D19_GESAMT_ANZ_24
D19_VOLLSORTIMENT_RZ
D19_SAMMELARTIKEL_RZ
D19_TELKO_REST_RZ
D19_KOSMETIK_RZ
D19_SONSTIGE_RZ
D19_DROGERIEARTIKEL_RZ
WACHSTUMSGEBIET_NB
GKZ
D19_LEBENSMITTEL_RZ
D19_ENERGIE_RZ
PLZ
PLZ8
D19_TELKO_ ANZ_12                  D19_TELKO_ ANZ_24
D19_HAUS_DEKO_RZ
D19_DIGIT_SERV_RZ
D19_RATGEBER_RZ
GEOSCORE_KLS7
D19_FREIZEIT_RZ
D19_VERSAND_REST_RZ
KBA13_CCM_1400_2500
D19_HANDWERK_RZ
D19_BUCH_RZ
D19_BILDUNG_RZ
D19_TIERARTIKEL_RZ
D19_WEIN_FEINKOST_RZ
D19_REISEN_RZ
D19_BIO_OEKO_RZ
EINWOHNER
D19_BANKEN_DIREKT_RZ

It looks that these information sheets have features that aren't included in our datasets. So we better ask now how many features can we get information about in our datasets from these two information sheets?

In [30]:
vals_in_azdias = dias_vals_feats.intersection(azdias_feats)
atts_in_azdias = dias_atts_feats.intersection(azdias_feats)

print("Number of featuers in Azdias:", azdias.shape[1])
print("Number of features shared between Azdias and Values:", len(vals_in_azdias))
print("Number of features shared between Azdias and Attributes:", len(atts_in_azdias))
Number of featuers in Azdias: 366
Number of features shared between Azdias and Values: 272
Number of features shared between Azdias and Attributes: 264

What are the features not present in Values or Attributes?

In [31]:
print("Features in Azdias not in Values or Attributes:")
lost_feats = sorted(list(set(azdias_feats).difference(atts_in_azdias)))
for feat in lost_feats:
    print(feat)
Features in Azdias not in Values or Attributes:
AKT_DAT_KL
ALTERSKATEGORIE_FEIN
ALTER_KIND1
ALTER_KIND2
ALTER_KIND3
ALTER_KIND4
ANZ_KINDER
ANZ_STATISTISCHE_HAUSHALTE
CAMEO_INTL_2015
CJT_KATALOGNUTZER
CJT_TYP_1
CJT_TYP_2
CJT_TYP_3
CJT_TYP_4
CJT_TYP_5
CJT_TYP_6
D19_BANKEN_ANZ_12
D19_BANKEN_ANZ_24
D19_BANKEN_DIREKT
D19_BANKEN_GROSS
D19_BANKEN_LOKAL
D19_BANKEN_REST
D19_BEKLEIDUNG_GEH
D19_BEKLEIDUNG_REST
D19_BILDUNG
D19_BIO_OEKO
D19_BUCH_CD
D19_DIGIT_SERV
D19_DROGERIEARTIKEL
D19_ENERGIE
D19_FREIZEIT
D19_GARTEN
D19_GESAMT_ANZ_12
D19_GESAMT_ANZ_24
D19_HANDWERK
D19_HAUS_DEKO
D19_KINDERARTIKEL
D19_KONSUMTYP_MAX
D19_KOSMETIK
D19_LEBENSMITTEL
D19_LETZTER_KAUF_BRANCHE
D19_LOTTO
D19_NAHRUNGSERGAENZUNG
D19_RATGEBER
D19_REISEN
D19_SAMMELARTIKEL
D19_SCHUHE
D19_SONSTIGE
D19_SOZIALES
D19_TECHNIK
D19_TELKO_ANZ_12
D19_TELKO_ANZ_24
D19_TELKO_MOBILE
D19_TELKO_ONLINE_QUOTE_12
D19_TELKO_REST
D19_TIERARTIKEL
D19_VERSAND_ANZ_12
D19_VERSAND_ANZ_24
D19_VERSAND_REST
D19_VERSICHERUNGEN
D19_VERSI_ANZ_12
D19_VERSI_ANZ_24
D19_VERSI_ONLINE_QUOTE_12
D19_VOLLSORTIMENT
D19_WEIN_FEINKOST
DSL_FLAG
EINGEFUEGT_AM
EINGEZOGENAM_HH_JAHR
EXTSEL992
FIRMENDICHTE
GEMEINDETYP
HH_DELTA_FLAG
KBA13_ANTG1
KBA13_ANTG2
KBA13_ANTG3
KBA13_ANTG4
KBA13_BAUMAX
KBA13_CCM_1401_2500
KBA13_CCM_3000
KBA13_CCM_3001
KBA13_GBZ
KBA13_HHZ
KBA13_KMH_210
KK_KUNDENTYP
KOMBIALTER
KONSUMZELLE
LNR
MOBI_RASTER
RT_KEIN_ANREIZ
RT_SCHNAEPPCHEN
RT_UEBERGROESSE
SOHO_KZ
STRUKTURTYP
UMFELD_ALT
UMFELD_JUNG
UNGLEICHENN_FLAG
VERDICHTUNGSRAUM
VHA
VHN
VK_DHT4A
VK_DISTANZ
VK_ZG11

What are the features not shared between Values and Attributes?

In [32]:
print("Features in Values not Attributes:")
for feat in vals_in_azdias.difference(atts_in_azdias):
    print(feat)
Features in Values not Attributes:
D19_VERSI_ANZ_12
D19_GESAMT_ANZ_24
KBA13_CCM_3000
D19_TELKO_ANZ_24
D19_VERSAND_ANZ_24
D19_TELKO_ANZ_12
D19_VERSI_ANZ_24
D19_GESAMT_ANZ_12
D19_BANKEN_ANZ_24
D19_BANKEN_ANZ_12
KBA13_CCM_3001
D19_VERSAND_ANZ_12

We can see that most of these features in difference between Values and Attributes are D19 features, which have their majority missing between Azdias and Values.

What is the distribution of Null values in the columns of both datasets?

In [33]:
azdias_null_cols = get_null_prop(azdias, plot=False)
customers_null_cols = get_null_prop(customers, plot=False)
In [34]:
azdias_null_cols.hist(bins=10, alpha=0.7, label='AZDIAS')
customers_null_cols.hist(bins=10, alpha=0.7, label='CUSTOMERS')
plt.title("Distribution of Null Values in Columns");
plt.legend()
plt.tight_layout()
plt.savefig("null_columns.png");
In [35]:
azdias_null_rows = get_null_prop(azdias, 1, False)
customers_null_rows = get_null_prop(customers, 1, False)
In [36]:
azdias_null_rows.hist(bins=10, alpha=0.7, label='AZDIAS')
customers_null_rows.hist(bins=10, alpha=0.7, label='CUSTOMERS')
plt.title("Distribution of Null Values in Rows");
plt.legend()
plt.tight_layout()
plt.savefig("null_rows.png");

From the above graphs we can see that the null percentages are definetly higher in Azdias dataset, and since we will be modeling the clustering algorithm on Azdias, we will focus our exploration on it, and only look into Customer if needed.

And since we touched on missing values, we can notice in the Values sheet that there are unknown values present in each feature, but at the same time we have null values.

How does the missing values distribution change if we encode unknown values as null?

In [37]:
# find the unknown values associated with each feature
feat_unknown_vals = dias_vals.query("Meaning == 'unknown'")
feat_unknown_vals.head()
Out[37]:
Unnamed: 0 Attribute Description Value Meaning
0 NaN AGER_TYP best-ager typology -1 unknown
5 NaN ALTERSKATEGORIE_GROB age classification through prename analysis -1, 0 unknown
33 NaN ANREDE_KZ gender -1, 0 unknown
40 NaN BALLRAUM distance to next urban centre -1 unknown
48 NaN BIP_FLAG business-flag indicating companies in the building -1 unknown
In [38]:
# replace unknown values with null
azdias_new = replace_unknown_with_null(azdias)
In [39]:
azdias_new_null_cols = get_null_prop(azdias_new, plot=False)
azdias_new_null_rows = get_null_prop(azdias_new, axis=1, plot=False)
In [40]:
azdias_new_null_cols.hist(bins=20, alpha=0.7, label='New AZDIAS')
azdias_null_cols.hist(bins=20, alpha=0.7, label='Old AZDIAS')
plt.title("Distribution of Null Values in Columns");
plt.legend()
plt.tight_layout()
plt.savefig("null_columns_azdias.png");
In [41]:
azdias_new_null_rows.hist(bins=20, alpha=0.7, label='New AZDIAS')
azdias_null_rows.hist(bins=20, alpha=0.7, label='Old AZDIAS')
plt.title("Distribution of Null Values in Rows");
plt.legend()
plt.tight_layout()
plt.savefig("null_rows_azdias.png");

After replacing the unknown values with null, we can see more features having higher percentage of null values than before.

The question now is how to deal with these missing values?

This will depend completely on the type of feature we are dealing with. As missing values can be divided into 3 categories:

  1. Missing completely at random (MCAR): where the absence of such data is completely unrelated to other observed data and unobserved data, which means that there is no pattern to the missing data.
  2. Missing at random (MAR): where the absence of such data is related to other observed data but not unobserved data, which means that there is a pattern to the missing data.
  3. Missing not at random: where the missing data is related to unobserved data and it signifies something, like a column about age of first child while the person related to the data point doesn't have a child.

The best way is to skim the features with their description, value counts and percentage of null values. We have a lot of features, and fortunately they are divided into categories. So we can check the features of each category and take notes along the way.

How many feature categories are there?

We need to take a look into the Attributes dataframe first.

In [42]:
dias_atts.head()
Out[42]:
Unnamed: 0 Information level Attribute Description Additional notes
0 NaN NaN AGER_TYP best-ager typology in cooperation with Kantar TNS; the information basis is a consumer survey
1 NaN Person ALTERSKATEGORIE_GROB age through prename analysis modelled on millions of first name-age-reference data
2 NaN NaN ANREDE_KZ gender NaN
3 NaN NaN CJT_GESAMTTYP Customer-Journey-Typology relating to the preferred information and buying channels of consumers relating to the preferred information, marketing and buying channels of consumers as well as their cross-channel usage. The information basis is a survey on the consumer channel preferences combined via a statistical modell with AZ DIAS data
4 NaN NaN FINANZ_MINIMALIST financial typology: low financial interest Gfk-Typology based on a representative household panel combined via a statistical modell with AZ DIAS data

Information level is the feature that has the category, but there are null values as this was a spreadsheet with merged cells. So we need to forwardfill, then backfill the first null value.

In [43]:
dias_atts["Information level"] = dias_atts["Information level"].fillna(method="ffill", axis=0)
dias_atts["Information level"] = dias_atts["Information level"].fillna(method="bfill", axis=0)
dias_atts.head()
Out[43]:
Unnamed: 0 Information level Attribute Description Additional notes
0 NaN Person AGER_TYP best-ager typology in cooperation with Kantar TNS; the information basis is a consumer survey
1 NaN Person ALTERSKATEGORIE_GROB age through prename analysis modelled on millions of first name-age-reference data
2 NaN Person ANREDE_KZ gender NaN
3 NaN Person CJT_GESAMTTYP Customer-Journey-Typology relating to the preferred information and buying channels of consumers relating to the preferred information, marketing and buying channels of consumers as well as their cross-channel usage. The information basis is a survey on the consumer channel preferences combined via a statistical modell with AZ DIAS data
4 NaN Person FINANZ_MINIMALIST financial typology: low financial interest Gfk-Typology based on a representative household panel combined via a statistical modell with AZ DIAS data
In [44]:
# remove features that aren't in Azdias
dias_atts = dias_atts.query("Attribute in @azdias_feats")
In [45]:
print("Number of feature categories:", dias_atts["Information level"].nunique())
print("\nFeature Categories:")
print(dias_atts["Information level"].value_counts())
Number of feature categories: 9

Feature Categories:
PLZ8                  112
Microcell (RR3_ID)     54
Person                 42
Household              25
Microcell (RR4_ID)     11
Building                9
RR1_ID                  5
Community               3
Postcode                3
Name: Information level, dtype: int64
In [46]:
def explore_category(category, azdias):
    """Prints description, null percentage and value counts of each feature in a specified category."""
    # query only features in category
    cat_feats = dias_atts[dias_atts["Information level"] == category]
    # calcualte null percentage of each features
    cat_feats["null_percentage"] = cat_feats["Attribute"].apply(lambda feat: azdias[feat].isna().sum()/azdias.shape[0])
    # sort by null percentage
    cat_feats = cat_feats.sort_values("null_percentage")
    
    print(f"Number of features in {category}: {len(cat_feats)}\n\n")
    for i, row in cat_feats.iterrows():
        feat = row["Attribute"]
        print(feat)
        print(row["Description"])
        print("Null percentage:", row["null_percentage"])
        print(azdias[feat].value_counts())
        print()

PLZ8 Features

According to https://datarade.ai/data-products/plz8-germany-and-plz8-germany-xxl, Germany as been divided into 84,000 PLZ8 boundaries, so this category contains socio-economic data related to each PLZ8 boundary which helps in optimizing distribution of promotional materials, and in our case the mail-order sales.

In [47]:
explore_category("PLZ8", azdias_new)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Number of features in PLZ8: 112


KBA13_ALTERHALTER_30
share of car owners below 31 within the PLZ8
Null percentage: 0.11871354018812394
3.0    333405
2.0    160653
4.0    147128
1.0     72911
5.0     71324
Name: KBA13_ALTERHALTER_30, dtype: int64

KBA13_OPEL
share of OPEL within the PLZ8
Null percentage: 0.11871354018812394
3.0    327618
2.0    164244
4.0    154681
1.0     72559
5.0     66319
Name: KBA13_OPEL, dtype: int64

KBA13_NISSAN
share of NISSAN within the PLZ8
Null percentage: 0.11871354018812394
3.0    335457
4.0    167124
2.0    160213
5.0     71427
1.0     51200
Name: KBA13_NISSAN, dtype: int64

KBA13_MOTOR
most common motor size within the PLZ8
Null percentage: 0.11871354018812394
3.0    474886
2.0    144655
4.0    102786
1.0     63094
Name: KBA13_MOTOR, dtype: int64

KBA13_MERCEDES
share of MERCEDES within the PLZ8
Null percentage: 0.11871354018812394
3.0    340379
4.0    178995
2.0    134513
5.0     82948
1.0     48586
Name: KBA13_MERCEDES, dtype: int64

KBA13_MAZDA
share of MAZDA within the PLZ8
Null percentage: 0.11871354018812394
3.0    343060
4.0    169148
2.0    156989
5.0     71832
1.0     44392
Name: KBA13_MAZDA, dtype: int64

KBA13_KW_121
share of cars with an engine power of more than 121 KW - PLZ8
Null percentage: 0.11871354018812394
3.0    282845
2.0    135033
1.0    105706
0.0     95195
4.0     88983
5.0     77659
Name: KBA13_KW_121, dtype: int64

KBA13_KW_120
share of cars with an engine power between 111 and 120 KW - PLZ8
Null percentage: 0.11871354018812394
3.0    281415
1.0    226709
4.0     94774
0.0     85028
5.0     73739
2.0     23756
Name: KBA13_KW_120, dtype: int64

KBA13_KW_110
share of cars with an engine power between 91 and 110 KW - PLZ8
Null percentage: 0.11871354018812394
3.0    275679
2.0    175418
0.0    124216
4.0     83780
1.0     63943
5.0     62385
Name: KBA13_KW_110, dtype: int64

KBA13_KW_90
share of cars with an engine power between 81 and 90 KW - PLZ8
Null percentage: 0.11871354018812394
3.0    277407
2.0    181685
0.0    133326
4.0     82747
5.0     58683
1.0     51573
Name: KBA13_KW_90, dtype: int64

KBA13_KW_80
share of cars with an engine power between 71 and 80 KW - PLZ8
Null percentage: 0.11871354018812394
3.0    269221
2.0    180524
0.0    129268
4.0     77562
1.0     77214
5.0     51632
Name: KBA13_KW_80, dtype: int64

KBA13_PEUGEOT
share of PEUGEOT within the PLZ8
Null percentage: 0.11871354018812394
3.0    340805
4.0    170378
2.0    154373
5.0     70008
1.0     49857
Name: KBA13_PEUGEOT, dtype: int64

KBA13_KW_61_120
share of cars with an engine power between 61 and 120 KW - PLZ8
Null percentage: 0.11871354018812394
3.0    360881
2.0    164699
4.0    161381
5.0     49311
1.0     49149
Name: KBA13_KW_61_120, dtype: int64

KBA13_KW_0_60
share of cars with less than 61 KW engine power - PLZ8
Null percentage: 0.11871354018812394
3.0    357321
2.0    165557
4.0    159703
1.0     54326
5.0     48514
Name: KBA13_KW_0_60, dtype: int64

KBA13_KW_60
share of cars with an engine power between 51 and 60 KW - PLZ8
Null percentage: 0.11871354018812394
3.0    267832
2.0    178524
0.0    132144
1.0     79264
4.0     78418
5.0     49239
Name: KBA13_KW_60, dtype: int64

KBA13_KW_50
share of cars with an engine power between 41 and 50 KW - PLZ8
Null percentage: 0.11871354018812394
3.0    273920
2.0    181545
0.0    143215
4.0     81007
5.0     55954
1.0     49780
Name: KBA13_KW_50, dtype: int64

KBA13_KW_40
share of cars with an engine power between 31 and 40 KW - PLZ8
Null percentage: 0.11871354018812394
3.0    283084
2.0    135456
1.0    121060
0.0     99260
4.0     85040
5.0     61521
Name: KBA13_KW_40, dtype: int64

KBA13_KW_30
share of cars up to 30 KW engine power - PLZ8
Null percentage: 0.11871354018812394
1.0    554887
2.0    142986
3.0     87548
Name: KBA13_KW_30, dtype: int64

KBA13_KRSZUL_NEU
share of newbuilt cars (referred to the county average) - PLZ8
Null percentage: 0.11871354018812394
2.0    378304
1.0    222157
3.0    153623
0.0     31337
Name: KBA13_KRSZUL_NEU, dtype: int64

KBA13_KRSSEG_VAN
share of vans (referred to the county average) - PLZ8
Null percentage: 0.11871354018812394
2.0    487626
1.0    169327
3.0    127767
0.0       701
Name: KBA13_KRSSEG_VAN, dtype: int64

KBA13_KRSSEG_OBER
share of upper class cars (referred to the county average) - PLZ8
Null percentage: 0.11871354018812394
2.0    516676
1.0    151665
3.0    116737
0.0       343
Name: KBA13_KRSSEG_OBER, dtype: int64

KBA13_KRSSEG_KLEIN
share of small cars (referred to the county average) - PLZ8
Null percentage: 0.11871354018812394
2.0    718366
1.0     35557
3.0     31418
0.0        80
Name: KBA13_KRSSEG_KLEIN, dtype: int64

KBA13_KRSHERST_FORD_OPEL
share of FORD/Opel (referred to the county average) - PLZ8
Null percentage: 0.11871354018812394
3.0    329240
4.0    166035
2.0    158597
1.0     65801
5.0     65690
0.0        58
Name: KBA13_KRSHERST_FORD_OPEL, dtype: int64

KBA13_KW_70
share of cars with an engine power between 61 and 70 KW - PLZ8
Null percentage: 0.11871354018812394
3.0    276717
2.0    184387
0.0    141548
4.0     78915
5.0     54143
1.0     49711
Name: KBA13_KW_70, dtype: int64

KBA13_RENAULT
share of RENAULT within the PLZ8
Null percentage: 0.11871354018812394
3.0    336384
4.0    163307
2.0    160781
5.0     70384
1.0     54565
Name: KBA13_RENAULT, dtype: int64

KBA13_SEG_GELAENDEWAGEN
share of allterrain within the PLZ8
Null percentage: 0.11871354018812394
3.0    345193
2.0    178645
4.0    143742
1.0     67786
5.0     50055
Name: KBA13_SEG_GELAENDEWAGEN, dtype: int64

KBA13_SEG_GROSSRAUMVANS
share of big sized vans within the PLZ8
Null percentage: 0.11871354018812394
3.0    340642
4.0    169944
2.0    152717
5.0     71749
1.0     50369
Name: KBA13_SEG_GROSSRAUMVANS, dtype: int64

KBA13_VW
share of VOLKSWAGEN within the PLZ8
Null percentage: 0.11871354018812394
3.0    336588
2.0    171175
4.0    149018
1.0     71506
5.0     57134
Name: KBA13_VW, dtype: int64

KBA13_VORB_3
share of cars with more than 2 preowner - PLZ8
Null percentage: 0.11871354018812394
3.0    264685
2.0    177458
0.0    143284
4.0     81176
5.0     64131
1.0     54687
Name: KBA13_VORB_3, dtype: int64

KBA13_VORB_2
share of cars with 2 preowner - PLZ8
Null percentage: 0.11871354018812394
3.0    363866
2.0    166317
4.0    162515
5.0     49491
1.0     43232
Name: KBA13_VORB_2, dtype: int64

KBA13_VORB_1_2
share of cars with 1 or 2 preowner - PLZ8
Null percentage: 0.11871354018812394
3.0    359262
2.0    173047
4.0    151120
1.0     61834
5.0     40158
Name: KBA13_VORB_1_2, dtype: int64

KBA13_VORB_1
share of cars with 1 preowner - PLZ8
Null percentage: 0.11871354018812394
3.0    361449
2.0    167076
4.0    158150
1.0     50939
5.0     47807
Name: KBA13_VORB_1, dtype: int64

KBA13_VORB_0
share of cars with no preowner - PLZ8
Null percentage: 0.11871354018812394
3.0    349837
4.0    174276
2.0    153752
5.0     71730
1.0     35826
Name: KBA13_VORB_0, dtype: int64

KBA13_TOYOTA
share of TOYOTA within the PLZ8
Null percentage: 0.11871354018812394
3.0    343193
4.0    166426
2.0    156050
5.0     72002
1.0     47750
Name: KBA13_TOYOTA, dtype: int64

KBA13_SITZE_6
number of cars with more than 5 seats in the PLZ8
Null percentage: 0.11871354018812394
3.0    336224
4.0    166528
2.0    147292
5.0     76974
1.0     58403
Name: KBA13_SITZE_6, dtype: int64

KBA13_SITZE_5
number of cars with 5 seats in the PLZ8
Null percentage: 0.11871354018812394
3.0    330228
2.0    179693
4.0    128787
1.0     91495
5.0     55218
Name: KBA13_SITZE_5, dtype: int64

KBA13_SITZE_4
number of cars with less than 5 seats in the PLZ8
Null percentage: 0.11871354018812394
3.0    328454
4.0    181695
2.0    129303
5.0     93443
1.0     52526
Name: KBA13_SITZE_4, dtype: int64

KBA13_SEG_WOHNMOBILE
share of roadmobiles within the PLZ8
Null percentage: 0.11871354018812394
3.0    269140
2.0    165205
1.0     95326
4.0     88952
0.0     85796
5.0     81002
Name: KBA13_SEG_WOHNMOBILE, dtype: int64

KBA13_SEG_VAN
share of vans within the PLZ8
Null percentage: 0.11871354018812394
3.0    341438
4.0    165710
2.0    157763
5.0     67515
1.0     52995
Name: KBA13_SEG_VAN, dtype: int64

KBA13_SEG_UTILITIES
share of MUVs/SUVs within the PLZ8
Null percentage: 0.11871354018812394
3.0    346456
2.0    164012
4.0    160080
5.0     61579
1.0     53294
Name: KBA13_SEG_UTILITIES, dtype: int64

KBA13_SEG_SPORTWAGEN
share of sportscars within the PLZ8
Null percentage: 0.11871354018812394
3.0    267922
2.0    146930
1.0    106267
4.0     92168
5.0     88713
0.0     83421
Name: KBA13_SEG_SPORTWAGEN, dtype: int64

KBA13_SEG_SONSTIGE
share of other cars within the PLZ8
Null percentage: 0.11871354018812394
3.0    352268
2.0    167674
4.0    165534
5.0     64535
1.0     35410
Name: KBA13_SEG_SONSTIGE, dtype: int64

KBA13_SEG_OBERKLASSE
share of upper class cars (BMW 7er etc.) in the PLZ8
Null percentage: 0.11871354018812394
3.0    283488
1.0    157682
4.0     91369
0.0     86270
5.0     84648
2.0     81964
Name: KBA13_SEG_OBERKLASSE, dtype: int64

KBA13_SEG_OBEREMITTELKLASSE
share of upper middle class cars and upper class cars (BMW5er, BMW7er etc.)
Null percentage: 0.11871354018812394
3.0    342284
4.0    184285
2.0    132852
5.0     81830
1.0     44170
Name: KBA13_SEG_OBEREMITTELKLASSE, dtype: int64

KBA13_SEG_MITTELKLASSE
share of middle class cars (Ford Mondeo etc.) in the PLZ8
Null percentage: 0.11871354018812394
3.0    337241
4.0    164192
2.0    156862
5.0     73287
1.0     53839
Name: KBA13_SEG_MITTELKLASSE, dtype: int64

KBA13_SEG_MINIWAGEN
share of minicars within the PLZ8
Null percentage: 0.11871354018812394
3.0    339598
4.0    176093
2.0    146739
5.0     77150
1.0     45841
Name: KBA13_SEG_MINIWAGEN, dtype: int64

KBA13_SEG_MINIVANS
share of minivans within the PLZ8
Null percentage: 0.11871354018812394
3.0    341862
2.0    161436
4.0    160044
5.0     65130
1.0     56949
Name: KBA13_SEG_MINIVANS, dtype: int64

KBA13_SEG_KOMPAKTKLASSE
share of lowe midclass cars (Ford Focus etc.) in the PLZ8
Null percentage: 0.11871354018812394
3.0    344398
2.0    173314
4.0    142952
1.0     64602
5.0     60155
Name: KBA13_SEG_KOMPAKTKLASSE, dtype: int64

KBA13_SEG_KLEINWAGEN
share of small and very small cars (Ford Fiesta, Ford Ka etc.) in the PLZ8
Null percentage: 0.11871354018812394
3.0    341514
2.0    167806
4.0    152870
1.0     68565
5.0     54666
Name: KBA13_SEG_KLEINWAGEN, dtype: int64

KBA13_SEG_KLEINST
share of very small cars (Ford Ka etc.) in the PLZ8
Null percentage: 0.11871354018812394
3.0    337591
2.0    161954
4.0    158570
1.0     64938
5.0     62368
Name: KBA13_SEG_KLEINST, dtype: int64

KBA13_KRSHERST_AUDI_VW
share of Volkswagen (referred to the county average) - PLZ8
Null percentage: 0.11871354018812394
3.0    337010
2.0    167559
4.0    162124
1.0     65798
5.0     52872
0.0        58
Name: KBA13_KRSHERST_AUDI_VW, dtype: int64

KBA13_KRSAQUOT
share of cars per household (referred to the county average) - PLZ8
Null percentage: 0.11871354018812394
3.0    324094
2.0    172621
4.0    142200
1.0     91812
5.0     54634
0.0        60
Name: KBA13_KRSAQUOT, dtype: int64

KBA13_KRSHERST_BMW_BENZ
share of BMW/Mercedes Benz (referred to the county average) - PLZ8
Null percentage: 0.11871354018812394
3.0    345264
4.0    165044
2.0    153245
5.0     74315
1.0     47495
0.0        58
Name: KBA13_KRSHERST_BMW_BENZ, dtype: int64

KBA13_KMH_250
share of cars with max speed between 210 and 250 km/h within the PLZ8
Null percentage: 0.11871354018812394
3.0    278290
2.0    161652
0.0    139756
4.0     88055
5.0     75224
1.0     42444
Name: KBA13_KMH_250, dtype: int64

KBA13_KMH_251
share of cars with a greater max speed than 250 km/h within the PLZ8
Null percentage: 0.11871354018812394
1.0    674722
3.0    100548
2.0     10151
Name: KBA13_KMH_251, dtype: int64

KBA13_CCM_2500
share of cars with 2000ccm to 2499ccm within the PLZ8
Null percentage: 0.11871354018812394
3.0    283932
2.0    144530
1.0    102470
0.0     95350
4.0     88885
5.0     70254
Name: KBA13_CCM_2500, dtype: int64

KBA13_CCM_2000
share of cars with 1800ccm to 1999ccm within the PLZ8
Null percentage: 0.11871354018812394
3.0    363131
4.0    170342
2.0    160922
5.0     57234
1.0     33792
Name: KBA13_CCM_2000, dtype: int64

KBA13_CCM_1800
share of cars with 1600ccm to 1799ccm within the PLZ8
Null percentage: 0.11871354018812394
3.0    276578
2.0    179863
0.0    137534
4.0     81550
5.0     57795
1.0     52101
Name: KBA13_CCM_1800, dtype: int64

KBA13_CCM_1600
share of cars with 1500ccm to 1599ccm within the PLZ8
Null percentage: 0.11871354018812394
3.0    364222
2.0    167171
4.0    163429
5.0     51951
1.0     38648
Name: KBA13_CCM_1600, dtype: int64

KBA13_CCM_1500
share of cars with 1400ccm to 1499ccm within the PLZ8
Null percentage: 0.11871354018812394
1.0    287731
4.0    206213
3.0    156747
5.0     68326
2.0     66404
Name: KBA13_CCM_1500, dtype: int64

KBA13_CCM_0_1400
share of cars with less than 1401ccm within the PLZ8
Null percentage: 0.11871354018812394
3.0    268319
2.0    178378
0.0    138711
4.0     81885
1.0     60025
5.0     58103
Name: KBA13_CCM_0_1400, dtype: int64

KBA13_CCM_1400
share of cars with 1200ccm to 1399ccm within the PLZ8
Null percentage: 0.11871354018812394
3.0    362764
2.0    169640
4.0    161205
5.0     49030
1.0     42782
Name: KBA13_CCM_1400, dtype: int64

KBA13_CCM_1200
share of cars with less than 1000ccm within the PLZ8
Null percentage: 0.11871354018812394
3.0    278653
2.0    161690
0.0    145802
4.0     81631
1.0     61072
5.0     56573
Name: KBA13_CCM_1200, dtype: int64

KBA13_CCM_1000
share of cars with less than 1000ccm within the PLZ8
Null percentage: 0.11871354018812394
3.0    290316
1.0    120040
2.0    119210
0.0    103227
4.0     86726
5.0     65902
Name: KBA13_CCM_1000, dtype: int64

KBA13_BMW
share of BMW within the PLZ8
Null percentage: 0.11871354018812394
3.0    346193
4.0    176871
2.0    139903
5.0     83249
1.0     39205
Name: KBA13_BMW, dtype: int64

KBA13_BJ_2009
share of cars built in 2009 within the PLZ8
Null percentage: 0.11871354018812394
3.0    286198
1.0    119909
2.0    118034
0.0    101115
4.0     88293
5.0     71872
Name: KBA13_BJ_2009, dtype: int64

KBA13_BJ_2008
share of cars built in 2008 within the PLZ8
Null percentage: 0.11871354018812394
3.0    274885
2.0    170564
0.0    134372
4.0     86070
5.0     69750
1.0     49780
Name: KBA13_BJ_2008, dtype: int64

KBA13_BJ_2006
share of cars built between 2005 and 2006 within the PLZ8
Null percentage: 0.11871354018812394
3.0    356794
2.0    166960
4.0    160793
1.0     52194
5.0     48680
Name: KBA13_BJ_2006, dtype: int64

KBA13_BJ_2004
share of cars built before 2004 within the PLZ8
Null percentage: 0.11871354018812394
3.0    364930
2.0    166228
4.0    157705
1.0     49522
5.0     47036
Name: KBA13_BJ_2004, dtype: int64

KBA13_BJ_2000
share of cars built between 2000 and 2003 within the PLZ8
Null percentage: 0.11871354018812394
3.0    347385
2.0    164179
4.0    159567
1.0     57775
5.0     56515
Name: KBA13_BJ_2000, dtype: int64

KBA13_BJ_1999
share of cars built between 1995 and 1999 within the PLZ8
Null percentage: 0.11871354018812394
3.0    359190
2.0    166296
4.0    159513
1.0     51058
5.0     49364
Name: KBA13_BJ_1999, dtype: int64

KBA13_AUTOQUOTE
share of cars per household within the PLZ8
Null percentage: 0.11871354018812394
3.0    322478
2.0    186121
4.0    130528
1.0    102004
5.0     44288
0.0         2
Name: KBA13_AUTOQUOTE, dtype: int64

KBA13_AUDI
share of AUDI within the PLZ8
Null percentage: 0.11871354018812394
3.0    346606
2.0    162879
4.0    158506
5.0     60877
1.0     56553
Name: KBA13_AUDI, dtype: int64

KBA13_ANZAHL_PKW
number of cars in the PLZ8
Null percentage: 0.11871354018812394
1400.0    11722
1500.0     8291
1300.0     6427
1600.0     6135
1700.0     3795
1800.0     2617
464.0      1604
417.0      1604
519.0      1600
534.0      1496
386.0      1458
1900.0     1450
395.0      1446
481.0      1417
455.0      1409
483.0      1393
452.0      1388
418.0      1384
454.0      1380
450.0      1380
494.0      1379
459.0      1379
492.0      1359
504.0      1340
387.0      1338
420.0      1337
439.0      1327
506.0      1326
388.0      1324
456.0      1323
          ...  
28.0         24
27.0         24
25.0         23
24.0         22
26.0         21
18.0         21
17.0         20
20.0         18
21.0         17
22.0         16
12.0         16
14.0         16
29.0         15
15.0         14
23.0         13
30.0         12
16.0         11
19.0         11
13.0         10
1.0           8
10.0          8
11.0          7
5.0           7
9.0           7
4.0           7
3.0           6
8.0           6
2.0           6
7.0           5
6.0           5
Name: KBA13_ANZAHL_PKW, Length: 1261, dtype: int64

KBA13_ALTERHALTER_61
share of car owners elder than 60 within the PLZ8
Null percentage: 0.11871354018812394
3.0    323096
4.0    177428
2.0    138065
5.0     87118
1.0     59714
Name: KBA13_ALTERHALTER_61, dtype: int64

KBA13_ALTERHALTER_60
share of car owners between 46 and 60 within the PLZ8
Null percentage: 0.11871354018812394
3.0    321522
2.0    188053
4.0    130673
1.0     93825
5.0     51348
Name: KBA13_ALTERHALTER_60, dtype: int64

KBA13_ALTERHALTER_45
share of car owners between 31 and 45 within the PLZ8
Null percentage: 0.11871354018812394
3.0    305775
4.0    161597
2.0    150705
5.0     97478
1.0     69866
Name: KBA13_ALTERHALTER_45, dtype: int64

KBA13_FAB_ASIEN
share of other Asian Manufacturers within the PLZ8
Null percentage: 0.11871354018812394
3.0    340870
2.0    169019
4.0    152278
1.0     62422
5.0     60832
Name: KBA13_FAB_ASIEN, dtype: int64

KBA13_FAB_SONSTIGE
share of other Manufacturers within the PLZ8
Null percentage: 0.11871354018812394
3.0    345124
2.0    167481
4.0    153466
5.0     61004
1.0     58346
Name: KBA13_FAB_SONSTIGE, dtype: int64

KBA13_CCM_2501
share of cars with more than 2501ccm within the PLZ8
Null percentage: 0.11871354018812394
3.0    294765
1.0    121797
2.0    106344
0.0     93403
4.0     91223
5.0     77889
Name: KBA13_CCM_2501, dtype: int64

KBA13_FORD
share of FORD within the PLZ8
Null percentage: 0.11871354018812394
3.0    335170
2.0    162145
4.0    154870
5.0     69233
1.0     64003
Name: KBA13_FORD, dtype: int64

KBA13_KMH_211
share of cars with a greater max speed than 210 km/h within the PLZ8
Null percentage: 0.11871354018812394
3.0    277452
2.0    162264
0.0    139463
4.0     88043
5.0     75823
1.0     42376
Name: KBA13_KMH_211, dtype: int64

KBA13_KMH_140_210
share of cars with max speed between 140 and 210 km/h within the PLZ8
Null percentage: 0.11871354018812394
3.0    361405
2.0    179354
4.0    133192
1.0     73767
5.0     37703
Name: KBA13_KMH_140_210, dtype: int64

KBA13_KMH_0_140
share of cars with max speed 140 km/h within the PLZ8
Null percentage: 0.11871354018812394
3.0    283566
1.0    234424
0.0     96010
4.0     92670
5.0     72077
2.0      6674
Name: KBA13_KMH_0_140, dtype: int64

KBA13_KMH_180
share of cars with max speed between 110 km/h and 180km/h within the PLZ8
Null percentage: 0.11871354018812394
3.0    355363
2.0    170291
4.0    155595
1.0     61574
5.0     42598
Name: KBA13_KMH_180, dtype: int64

KBA13_KMH_140
share of cars with max speed between 110 km/h and 140km/h within the PLZ8
Null percentage: 0.11871354018812394
1.0    249773
4.0    202067
3.0    167221
2.0     91648
5.0     74712
Name: KBA13_KMH_140, dtype: int64

KBA13_KMH_110
share of cars with max speed 110 km/h within the PLZ8
Null percentage: 0.11871354018812394
1.0    627623
3.0     94175
2.0     63623
Name: KBA13_KMH_110, dtype: int64

KBA13_HERST_SONST
share of other cars within the PLZ8
Null percentage: 0.11871354018812394
3.0    345124
2.0    167481
4.0    153466
5.0     61004
1.0     58346
Name: KBA13_HERST_SONST, dtype: int64

KBA13_HERST_FORD_OPEL
share of Ford & Opel/Vauxhall within the PLZ8
Null percentage: 0.11871354018812394
3.0    326805
2.0    164003
4.0    154044
1.0     74276
5.0     66293
Name: KBA13_HERST_FORD_OPEL, dtype: int64

KBA13_FIAT
share of FIAT within the PLZ8
Null percentage: 0.11871354018812394
3.0    343347
4.0    174024
2.0    148334
5.0     78722
1.0     40994
Name: KBA13_FIAT, dtype: int64

KBA13_HERST_BMW_BENZ
share of BMW & Mercedes Benz within the PLZ8
Null percentage: 0.11871354018812394
3.0    339754
4.0    180052
2.0    133074
5.0     86958
1.0     45583
Name: KBA13_HERST_BMW_BENZ, dtype: int64

KBA13_HERST_AUDI_VW
share of Volkswagen & Audi within the PLZ8
Null percentage: 0.11871354018812394
3.0    336178
2.0    172160
4.0    149322
1.0     72901
5.0     54860
Name: KBA13_HERST_AUDI_VW, dtype: int64

KBA13_HERST_EUROPA
share of European cars within the PLZ8
Null percentage: 0.11871354018812394
3.0    341097
4.0    170642
2.0    151037
5.0     72872
1.0     49773
Name: KBA13_HERST_EUROPA, dtype: int64

KBA13_HALTER_66
share of car owners over 66 within the PLZ8
Null percentage: 0.11871354018812394
3.0    320451
4.0    175161
2.0    139386
5.0     86577
1.0     63846
Name: KBA13_HALTER_66, dtype: int64

KBA13_HALTER_20
share of car owners below 21 within the PLZ8
Null percentage: 0.11871354018812394
3.0    338233
2.0    184872
4.0    146416
1.0     66025
5.0     49875
Name: KBA13_HALTER_20, dtype: int64

KBA13_HERST_ASIEN
share of asian cars within the PLZ8
Null percentage: 0.11871354018812394
3.0    338074
2.0    162979
4.0    155084
5.0     67474
1.0     61810
Name: KBA13_HERST_ASIEN, dtype: int64

KBA13_HALTER_30
share of car owners between 26 and 30 within the PLZ8
Null percentage: 0.11871354018812394
3.0    322185
2.0    155663
4.0    150541
5.0     90957
1.0     66075
Name: KBA13_HALTER_30, dtype: int64

KBA13_HALTER_35
share of car owners between 31 and 35 within the PLZ8
Null percentage: 0.11871354018812394
3.0    309769
4.0    160396
2.0    151032
5.0    100377
1.0     63847
Name: KBA13_HALTER_35, dtype: int64

KBA13_HALTER_40
share of car owners between 36 and 40 within the PLZ8
Null percentage: 0.11871354018812394
3.0    313672
4.0    159800
2.0    151984
5.0     95692
1.0     64273
Name: KBA13_HALTER_40, dtype: int64

KBA13_HALTER_25
share of car owners between 21 and 25 within the PLZ8
Null percentage: 0.11871354018812394
3.0    341430
2.0    165111
4.0    144771
1.0     72751
5.0     61358
Name: KBA13_HALTER_25, dtype: int64

KBA13_HALTER_50
share of car owners between 46 and 50 within the PLZ8
Null percentage: 0.11871354018812394
3.0    325071
2.0    183530
4.0    133612
1.0     89591
5.0     53617
Name: KBA13_HALTER_50, dtype: int64

KBA13_HALTER_55
share of car owners between 51 and 55 within the PLZ8
Null percentage: 0.11871354018812394
3.0    319411
2.0    183685
4.0    135541
1.0     92743
5.0     54041
Name: KBA13_HALTER_55, dtype: int64

KBA13_HALTER_60
share of car owners between 56 and 60 within the PLZ8
Null percentage: 0.11871354018812394
3.0    321266
2.0    172974
4.0    140907
1.0     88762
5.0     61512
Name: KBA13_HALTER_60, dtype: int64

KBA13_HALTER_65
share of car owners between 61 and 65 within the PLZ8
Null percentage: 0.11871354018812394
3.0    331364
4.0    175040
2.0    140351
5.0     85579
1.0     53087
Name: KBA13_HALTER_65, dtype: int64

KBA13_HALTER_45
share of car owners between 41 and 45 within the PLZ8
Null percentage: 0.11871354018812394
3.0    318028
4.0    160040
2.0    158198
5.0     79488
1.0     69667
Name: KBA13_HALTER_45, dtype: int64

PLZ8_HHZ
number of households within the PLZ8
Null percentage: 0.13073637178657146
3.0    309146
4.0    211911
5.0    175813
2.0     66891
1.0     10945
Name: PLZ8_HHZ, dtype: int64

PLZ8_ANTG1
number of 1-2 family houses in the PLZ8
Null percentage: 0.13073637178657146
2.0    270590
3.0    222355
1.0    189247
4.0     87044
0.0      5470
Name: PLZ8_ANTG1, dtype: int64

PLZ8_ANTG2
number of 3-5 family houses in the PLZ8
Null percentage: 0.13073637178657146
3.0    307283
2.0    215767
4.0    191005
1.0     53213
0.0      7438
Name: PLZ8_ANTG2, dtype: int64

PLZ8_ANTG3
number of 6-10 family houses in the PLZ8
Null percentage: 0.13073637178657146
2.0    252994
1.0    237878
3.0    164040
0.0    119794
Name: PLZ8_ANTG3, dtype: int64

PLZ8_ANTG4
number of >10 family houses in the PLZ8
Null percentage: 0.13073637178657146
0.0    356389
1.0    294986
2.0    123331
Name: PLZ8_ANTG4, dtype: int64

PLZ8_BAUMAX
most common building-type within the PLZ8
Null percentage: 0.13073637178657146
1.0    499550
5.0     97333
2.0     70407
4.0     56684
3.0     50732
Name: PLZ8_BAUMAX, dtype: int64

PLZ8_GBZ
number of buildings within the PLZ8
Null percentage: 0.13073637178657146
3.0    288383
4.0    180252
5.0    153883
2.0    111588
1.0     40600
Name: PLZ8_GBZ, dtype: int64

By skimming through the results I can see that:

  1. All features are ordinal categorical except KBA13_ANZAHL_PKW
  2. 105 features have the same null percentage which is 11.87%, while the remaining 7 have 13.07%
  3. KBA13_ANZAHL_PKW is supposed to encode the number of cars in the PLZ8, but it has high values of peculiar number which are 1400, 1500, 1300, etc.

Visualizing the null percentages

In [48]:
plz8_feats = dias_atts[dias_atts["Information level"] == "PLZ8"]["Attribute"].unique()
plz8_azdias = azdias_new.loc[:, plz8_feats]
null_p = plz8_azdias.isna().sum()*100/plz8_azdias.shape[0]
null_p.value_counts().plot(kind="bar", title="Null Percentages in PLZ8 Features");
plt.tight_layout();
plt.savefig("null_plz8.png");

This rings the bell for data MCAR or MNAR, as there is a pattern that is most likely unrelated to observed data, but can or can't be related to unobserved data, which is why some persons just don't have PLZ8 data collected.

Should we remove these features?

To answer this question, we first need to know if there is a difference between people who don't have PLZ8 and those who do, as might introduce bias into the model if removed data for a specific group of people.

We can further investigate this difference using other features to test if the data was MNAR or MCAR.

Visualize KBA13_ANZAHL_PKW

In [49]:
tmp = azdias_new["KBA13_ANZAHL_PKW"]
tmp.plot(kind="hist", bins=50, title="KBA13_ANZAHL_PKW");
plt.tight_layout()
plt.savefig("KBA13_ANZAHL_PKW.png")

We can see that the bins starts getting less granular is we exceed 1200. My guess is that this data was spread between 1300 and the max values, but the granularity of these section was decreased, which explains why the distribution is right skewed but then we start seeing bumps near the end.

How should we deal with this?

We could leave it as is, as I don't think it would make much difference. Or we can follow the lead of the last section and reduce the granularity of the whole feature.

This could be tested while modelling to see if it would have any effect.

Microcell (RR3_ID) Features

In [50]:
explore_category("Microcell (RR3_ID)", azdias_new)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Number of features in Microcell (RR3_ID): 54


KBA05_AUTOQUOT
share of cars per household
Null percentage: 0.14959701353536328
3.0    258013
4.0    194706
2.0    123320
1.0     84157
5.0     82910
9.0     14791
Name: KBA05_AUTOQUOT, dtype: int64

KBA05_GBZ
number of buildings in the microcell
Null percentage: 0.14959701353536328
3.0    197833
5.0    158971
4.0    155301
2.0    138528
1.0    107264
Name: KBA05_GBZ, dtype: int64

KBA05_MOTRAD
share of motorcycles per household
Null percentage: 0.164421619328988
1.0    392117
0.0    204268
2.0     74250
3.0     74050
Name: KBA05_MOTRAD, dtype: int64

KBA05_MAXVORB
most common preowner structure in the microcell
Null percentage: 0.16618773570191905
2.0    323300
3.0    240866
1.0    178945
Name: KBA05_MAXVORB, dtype: int64

KBA05_MOD1
share of upper class cars (in an AZ specific definition)
Null percentage: 0.16618773570191905
0.0    286087
2.0    180755
1.0    140854
3.0     87732
4.0     47683
Name: KBA05_MOD1, dtype: int64

KBA05_MOD2
share of middle class cars (in an AZ specific definition)
Null percentage: 0.16618773570191905
3.0    301207
2.0    160999
4.0    157456
1.0     65695
5.0     57754
Name: KBA05_MOD2, dtype: int64

KBA05_MOD3
share of Golf-class cars (in an AZ specific definition)
Null percentage: 0.16618773570191905
3.0    276748
2.0    170403
4.0    165736
1.0     67924
5.0     62300
Name: KBA05_MOD3, dtype: int64

KBA05_MOD4
share of small cars (in an AZ specific definition)
Null percentage: 0.16618773570191905
3.0    223139
2.0    160094
4.0    130801
1.0     97881
5.0     80606
0.0     50590
Name: KBA05_MOD4, dtype: int64

KBA05_MOD8
share of vans (in an AZ specific definition)
Null percentage: 0.16618773570191905
0.0    221889
1.0    217315
2.0    216657
3.0     87250
Name: KBA05_MOD8, dtype: int64

KBA05_MOTOR
most common engine size in the microcell
Null percentage: 0.16618773570191905
3.0    289858
2.0    222119
1.0    121085
4.0    110049
Name: KBA05_MOTOR, dtype: int64

KBA05_SEG1
share of very small cars (Ford Ka etc.) in the microcell
Null percentage: 0.16618773570191905
1.0    251176
0.0    246416
2.0    185910
3.0     59609
Name: KBA05_SEG1, dtype: int64

KBA05_SEG2
share of small and very small cars (Ford Fiesta, Ford Ka etc.) in the microcell
Null percentage: 0.16618773570191905
3.0    300423
4.0    164242
2.0    152469
1.0     69404
5.0     56573
Name: KBA05_SEG2, dtype: int64

KBA05_SEG3
share of lowe midclass cars (Ford Focus etc.) in the microcell
Null percentage: 0.16618773570191905
3.0    271266
2.0    184407
4.0    163976
1.0     62378
5.0     61084
Name: KBA05_SEG3, dtype: int64

KBA05_SEG4
share of middle class cars (Ford Mondeo etc.) in the microcell
Null percentage: 0.16618773570191905
3.0    322991
2.0    152522
4.0    143664
1.0     62174
5.0     61760
Name: KBA05_SEG4, dtype: int64

KBA05_SEG5
share of upper middle class cars and upper class cars (BMW5er, BMW7er etc.)
Null percentage: 0.16618773570191905
1.0    235093
2.0    183424
0.0    182816
3.0     91014
4.0     50764
Name: KBA05_SEG5, dtype: int64

KBA05_SEG6
share of upper class cars (BMW 7er etc.) in the microcell
Null percentage: 0.16618773570191905
0.0    654630
1.0     88481
Name: KBA05_SEG6, dtype: int64

KBA05_SEG7
share of all-terrain vehicles and MUVs in the microcell
Null percentage: 0.16618773570191905
0.0    368086
1.0    183860
2.0    141172
3.0     49993
Name: KBA05_SEG7, dtype: int64

KBA05_SEG8
share of roadster and convertables in the microcell
Null percentage: 0.16618773570191905
0.0    403849
1.0    173773
2.0    120236
3.0     45253
Name: KBA05_SEG8, dtype: int64

KBA05_SEG9
share of vans in the microcell
Null percentage: 0.16618773570191905
0.0    257693
1.0    240744
2.0    188097
3.0     56577
Name: KBA05_SEG9, dtype: int64

KBA05_SEG10
share of more specific cars (Vans, convertables, all-terrains, MUVs etc.)
Null percentage: 0.16618773570191905
2.0    267744
1.0    151524
3.0    148664
0.0    111769
4.0     63410
Name: KBA05_SEG10, dtype: int64

KBA05_VORB0
share of cars with no preowner
Null percentage: 0.16618773570191905
3.0    243780
2.0    173206
4.0    162160
1.0    107076
5.0     56889
Name: KBA05_VORB0, dtype: int64

KBA05_VORB1
share of cars with one or two preowner
Null percentage: 0.16618773570191905
3.0    310192
2.0    153319
4.0    148709
5.0     65624
1.0     65267
Name: KBA05_VORB1, dtype: int64

KBA05_VORB2
share of cars with more than two preowner
Null percentage: 0.16618773570191905
3.0    234755
2.0    160742
4.0    120239
5.0     88479
1.0     84539
0.0     54357
Name: KBA05_VORB2, dtype: int64

KBA05_ZUL1
share of cars built before 1994
Null percentage: 0.16618773570191905
3.0    299375
4.0    158290
2.0    156849
1.0     67661
5.0     60936
Name: KBA05_ZUL1, dtype: int64

KBA05_ZUL2
share of cars built between 1994 and 2000
Null percentage: 0.16618773570191905
3.0    288618
2.0    166431
4.0    159876
1.0     64734
5.0     63452
Name: KBA05_ZUL2, dtype: int64

KBA05_MAXSEG
most common car segment in the microcell
Null percentage: 0.16618773570191905
2.0    299180
1.0    202835
3.0    171954
4.0     69142
Name: KBA05_MAXSEG, dtype: int64

KBA05_MAXHERST
most common car manufacturer in the microcell
Null percentage: 0.16618773570191905
2.0    270729
3.0    209450
4.0    116436
1.0     81673
5.0     64823
Name: KBA05_MAXHERST, dtype: int64

KBA05_MAXBJ
most common age of the cars in the microcell
Null percentage: 0.16618773570191905
1.0    256917
4.0    187538
2.0    183360
3.0    115296
Name: KBA05_MAXBJ, dtype: int64

KBA05_MAXAH
most common age of car owners in the microcell
Null percentage: 0.16618773570191905
3.0    209157
5.0    195036
2.0    185708
4.0    102197
1.0     51013
Name: KBA05_MAXAH, dtype: int64

KBA05_CCM1
share of cars with less than 1399ccm
Null percentage: 0.16618773570191905
3.0    290001
2.0    170860
4.0    148741
1.0     67781
5.0     65728
Name: KBA05_CCM1, dtype: int64

KBA05_CCM2
share of cars with 1400ccm to 1799 ccm
Null percentage: 0.16618773570191905
3.0    301075
4.0    163350
2.0    157818
1.0     62138
5.0     58730
Name: KBA05_CCM2, dtype: int64

KBA05_CCM3
share of cars with 1800ccm to 2499 ccm
Null percentage: 0.16618773570191905
3.0    285942
4.0    166348
2.0    154038
5.0     70510
1.0     66273
Name: KBA05_CCM3, dtype: int64

KBA05_CCM4
share of cars with more than 2499ccm
Null percentage: 0.16618773570191905
0.0    274064
1.0    214682
2.0    128431
3.0     78631
4.0     47303
Name: KBA05_CCM4, dtype: int64

KBA05_DIESEL
share of cars with Diesel-engine in the microcell
Null percentage: 0.16618773570191905
2.0    294616
3.0    163675
1.0    155449
4.0     64771
0.0     64600
Name: KBA05_DIESEL, dtype: int64

KBA05_FRAU
share of female car owners
Null percentage: 0.16618773570191905
3.0    303220
2.0    153893
4.0    146721
5.0     70099
1.0     69178
Name: KBA05_FRAU, dtype: int64

KBA05_HERST1
share of top German manufacturer (Mercedes, BMW) 
Null percentage: 0.16618773570191905
2.0    225687
3.0    177138
1.0    118781
4.0     87517
0.0     75567
5.0     58421
Name: KBA05_HERST1, dtype: int64

KBA05_HERST2
share of Volkswagen-Cars (including Audi)
Null percentage: 0.16618773570191905
3.0    301932
2.0    172981
4.0    152039
5.0     60682
1.0     55477
Name: KBA05_HERST2, dtype: int64

KBA05_HERST3
share of Ford/Opel
Null percentage: 0.16618773570191905
3.0    298419
2.0    159475
4.0    147284
5.0     60821
1.0     60323
0.0     16789
Name: KBA05_HERST3, dtype: int64

KBA05_HERST4
share of European manufacturer (e.g. Fiat, Peugeot, Rover,...)
Null percentage: 0.16618773570191905
3.0    259533
2.0    164362
4.0    142648
1.0     75131
5.0     70698
0.0     30739
Name: KBA05_HERST4, dtype: int64

KBA05_ZUL3
share of cars built between 2001 and 2002
Null percentage: 0.16618773570191905
3.0    224580
2.0    160683
4.0    156551
1.0     73877
0.0     71287
5.0     56133
Name: KBA05_ZUL3, dtype: int64

KBA05_HERST5
share of asian manufacturer (e.g. Toyota, Kia,...)
Null percentage: 0.16618773570191905
3.0    242170
2.0    164321
4.0    159261
5.0     65111
1.0     64953
0.0     47295
Name: KBA05_HERST5, dtype: int64

KBA05_KRSHERST1
share of Mercedes/BMW (reffered to the county average)
Null percentage: 0.16618773570191905
3.0    299103
2.0    174764
4.0    161780
1.0     63300
5.0     44164
Name: KBA05_KRSHERST1, dtype: int64

KBA05_KRSHERST2
share of Volkswagen (reffered to the county average)
Null percentage: 0.16618773570191905
3.0    297896
2.0    160002
4.0    152437
1.0     71757
5.0     61019
Name: KBA05_KRSHERST2, dtype: int64

KBA05_KRSHERST3
share of Ford/Opel (reffered to the county average)
Null percentage: 0.16618773570191905
3.0    293298
2.0    154609
4.0    147717
5.0     82429
1.0     65058
Name: KBA05_KRSHERST3, dtype: int64

KBA05_KRSKLEIN
share of small cars (referred to the county average)
Null percentage: 0.16618773570191905
2.0    436383
1.0    156643
3.0    150085
Name: KBA05_KRSKLEIN, dtype: int64

KBA05_KRSOBER
share of upper class cars (referred to the county average)
Null percentage: 0.16618773570191905
2.0    464492
1.0    152048
3.0    126571
Name: KBA05_KRSOBER, dtype: int64

KBA05_KRSVAN
share of vans (referred to the county average)
Null percentage: 0.16618773570191905
2.0    492053
1.0    125937
3.0    125121
Name: KBA05_KRSVAN, dtype: int64

KBA05_KRSZUL
share of newbuilt cars (referred to the county average)
Null percentage: 0.16618773570191905
2.0    380095
1.0    208542
3.0    154474
Name: KBA05_KRSZUL, dtype: int64

KBA05_KW1
share of cars with less than 59 KW engine power
Null percentage: 0.16618773570191905
3.0    274856
4.0    160522
2.0    160221
1.0     78285
5.0     69227
Name: KBA05_KW1, dtype: int64

KBA05_KW2
share of cars with an engine power between 60 and 119 KW
Null percentage: 0.16618773570191905
3.0    306246
2.0    155172
4.0    152081
5.0     64938
1.0     64674
Name: KBA05_KW2, dtype: int64

KBA05_KW3
share of cars with an engine power of more than 119 KW
Null percentage: 0.16618773570191905
1.0    233518
0.0    206843
2.0    160776
3.0     80358
4.0     61616
Name: KBA05_KW3, dtype: int64

KBA05_KRSAQUOT
share of cars per household (reffered to county average)
Null percentage: 0.16618773570191905
3.0    283526
2.0    151394
4.0    143668
1.0     83978
5.0     80545
Name: KBA05_KRSAQUOT, dtype: int64

KBA05_ZUL4
share of cars built from 2003 on
Null percentage: 0.16618773570191905
2.0    183127
1.0    174910
3.0    125299
0.0    105584
4.0    100351
5.0     53840
Name: KBA05_ZUL4, dtype: int64

KBA05_BAUMAX
most common building-type within the cell
Null percentage: 0.5346866826522265
1.0    208417
5.0     98923
3.0     59955
4.0     37718
2.0      9684
Name: KBA05_BAUMAX, dtype: int64

By skimming through the results I can see that:

  1. The most prominent null percentage is 16.6%, while a few have less and only has 53.46% which is KBA05_BAUMAX (most common building-type within the cell).

Why is this data missing?

This is data about a collective of individuals, and it should mean that we have any indicator about this collective, like an identifier for the microcell of this person, or the plz8.

I looked for any feature that has the postcode or the PLZ8 area or anything, but I didn't find anything. Therefore I think that in handling missing data we should do the following:

  1. First we should look for rows that have high percentage of missing values across all feature categories
  2. We should then drop these rows as we can't use other feature categories to infer them
  3. We shall then impute the missing values only when the rows are missing from a feature category, but not from the other. That's because the data is extremely related to each other (we have information about the person, their houshold, their community, their area, their microcell and their plz8, and all of these are related to each other, so we can use them to infer missing information about each other.)

Person Features

In [51]:
explore_category("Person", azdias_new)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Number of features in Person: 42


ZABEOTYP
typification of energy consumers
Null percentage: 0.0
3    364905
4    210095
1    123622
5     84956
6     74473
2     33170
Name: ZABEOTYP, dtype: int64

SEMIO_REL
affinity indicating in what way the person is religious
Null percentage: 0.0
7    211377
4    207128
3    150801
1    108130
5     79566
2     73127
6     61092
Name: SEMIO_REL, dtype: int64

SEMIO_MAT
affinity indicating in what way the person is material minded
Null percentage: 0.0
5    171267
4    162862
2    134549
3    123701
7    111976
1     97341
6     89525
Name: SEMIO_MAT, dtype: int64

SEMIO_VERT
affinity indicating in what way the person is dreamily
Null percentage: 0.0
2    204333
6    141714
5    135205
7    134756
4    122982
1    120437
3     31794
Name: SEMIO_VERT, dtype: int64

SEMIO_LUST
affinity indicating in what way the person is sensual minded
Null percentage: 0.0
5    170040
6    158624
7    158234
2    114373
1    110382
4     97495
3     82073
Name: SEMIO_LUST, dtype: int64

SEMIO_ERL
affinity indicating in what way the person is eventful orientated
Null percentage: 0.0
4    196206
3    180824
7    179141
6    139209
2     77012
5     76133
1     42696
Name: SEMIO_ERL, dtype: int64

SEMIO_KULT
affinity indicating in what way the person is cultural minded
Null percentage: 0.0
3    209067
5    176282
1    128216
7    117378
4    101502
6    101286
2     57490
Name: SEMIO_KULT, dtype: int64

SEMIO_RAT
affinity indicating in what way the person is of a rational mind
Null percentage: 0.0
4    334456
2    140433
3    131994
5     89056
7     87024
6     61484
1     46774
Name: SEMIO_RAT, dtype: int64

SEMIO_KRIT
affinity indicating in what way the person is critical minded
Null percentage: 0.0
7    219847
5    156298
4    144079
6    133049
3    129106
1     54947
2     53895
Name: SEMIO_KRIT, dtype: int64

SEMIO_DOM
affinity indicating in what way the person is dominant minded
Null percentage: 0.0
6    183435
5    177889
7    161495
4    125115
2    101498
3     97027
1     44762
Name: SEMIO_DOM, dtype: int64

SEMIO_KAEM
affinity indicating in what way the person is of a fightfull attitude
Null percentage: 0.0
6    206001
3    180955
7    135579
5    128501
2    114038
4     78944
1     47203
Name: SEMIO_KAEM, dtype: int64

GREEN_AVANTGARDE
the environmental sustainability is the dominating movement in the youth of these consumers
Null percentage: 0.0
0    715996
1    175225
Name: GREEN_AVANTGARDE, dtype: int64

SEMIO_PFLICHT
affinity indicating in what way the person is dutyfull traditional minded
Null percentage: 0.0
5    203845
4    162117
3    133990
7    115458
6    109442
2     92214
1     74155
Name: SEMIO_PFLICHT, dtype: int64

GEBURTSJAHR
year of birth
Null percentage: 0.0
0       392318
1967     11183
1965     11090
1966     10933
1970     10883
1964     10799
1968     10792
1963     10513
1969     10360
1980     10275
1962     10082
1961      9880
1971      9786
1982      9516
1978      9509
1960      9492
1979      9422
1981      9374
1977      9296
1959      9098
1972      9027
1976      9005
1983      8887
1974      8676
1984      8553
1975      8480
1973      8356
1958      8323
1986      8192
1985      8180
         ...  
2009       559
2008       550
2010       545
2011       485
1923       468
2013       380
1922       375
1921       355
2015       257
1920       238
1919       194
2016       167
2014       124
1918        85
1914        55
1917        55
1916        45
1910        41
1913        39
1915        37
1911        30
1912        28
1905         8
1908         7
1906         7
1909         7
1904         5
1907         4
1900         4
1902         1
Name: GEBURTSJAHR, Length: 117, dtype: int64

FINANZTYP
best descirbing financial type for the peron
Null percentage: 0.0
6    290367
1    199572
4    130625
2    110867
5    106436
3     53354
Name: FINANZTYP, dtype: int64

FINANZ_HAUSBAUER
financial typology: main focus is the own house
Null percentage: 0.0
3    235184
5    183918
2    171847
4    157168
1    143104
Name: FINANZ_HAUSBAUER, dtype: int64

FINANZ_UNAUFFAELLIGER
financial typology: unremarkable
Null percentage: 0.0
1    220597
5    200551
2    185749
3    170628
4    113696
Name: FINANZ_UNAUFFAELLIGER, dtype: int64

FINANZ_ANLEGER
financial typology: investor
Null percentage: 0.0
5    234508
1    210812
2    161286
4    143597
3    141018
Name: FINANZ_ANLEGER, dtype: int64

FINANZ_VORSORGER
financial typology: be prepared
Null percentage: 0.0
5    242262
3    229842
4    198218
2    116530
1    104369
Name: FINANZ_VORSORGER, dtype: int64

FINANZ_SPARER
financial typology: money saver
Null percentage: 0.0
1    250213
4    201223
2    153051
5    146380
3    140354
Name: FINANZ_SPARER, dtype: int64

FINANZ_MINIMALIST
financial typology: low financial interest
Null percentage: 0.0
3    256276
5    168863
4    167182
2    159313
1    139587
Name: FINANZ_MINIMALIST, dtype: int64

SEMIO_TRADV
affinity indicating in what way the person is traditional minded
Null percentage: 0.0
3    226571
4    174203
2    132657
5    117378
1     96775
7     76133
6     67504
Name: SEMIO_TRADV, dtype: int64

ANREDE_KZ
gender
Null percentage: 0.0
2    465305
1    425916
Name: ANREDE_KZ, dtype: int64

ALTERSKATEGORIE_GROB
age through prename analysis 
Null percentage: 0.0
3    358533
4    228510
2    158410
1    142887
9      2881
Name: ALTERSKATEGORIE_GROB, dtype: int64

SEMIO_SOZ
affinity indicating in what way the person is social minded
Null percentage: 0.0
2    244714
6    136205
5    121786
3    118889
7    117378
4     90161
1     62088
Name: SEMIO_SOZ, dtype: int64

SEMIO_FAM
affinity indicating in what way the person is familiar minded
Null percentage: 0.0
6    186729
2    139562
4    135942
5    133740
7    118517
3     94815
1     81916
Name: SEMIO_FAM, dtype: int64

LP_STATUS_GROB
social status rough
Null percentage: 0.005446460529992
1.0    337511
2.0    226915
4.0    162946
5.0    118022
3.0     40973
Name: LP_STATUS_GROB, dtype: int64

LP_STATUS_FEIN
social status fine 
Null percentage: 0.005446460529992
1.0     219275
9.0     143238
2.0     118236
10.0    118022
4.0      78317
5.0      74493
3.0      74105
6.0      30914
8.0      19708
7.0      10059
Name: LP_STATUS_FEIN, dtype: int64

LP_FAMILIE_GROB
family type rough
Null percentage: 0.005446460529992
1.0    426379
5.0    200780
2.0    104305
0.0     72938
4.0     52784
3.0     29181
Name: LP_FAMILIE_GROB, dtype: int64

LP_FAMILIE_FEIN
family type fine
Null percentage: 0.005446460529992
1.0     426379
10.0    137913
2.0     104305
0.0      72938
11.0     51719
8.0      23032
7.0      20730
4.0      12303
5.0      11920
9.0      11148
6.0       9022
3.0       4958
Name: LP_FAMILIE_FEIN, dtype: int64

LP_LEBENSPHASE_GROB
lifestage rough
Null percentage: 0.005446460529992
2.0     158139
1.0     139681
3.0     115624
0.0      89718
12.0     74276
4.0      54443
5.0      49672
9.0      48938
10.0     41092
11.0     32819
8.0      30323
6.0      29181
7.0      22461
Name: LP_LEBENSPHASE_GROB, dtype: int64

LP_LEBENSPHASE_FEIN
lifestage fine
Null percentage: 0.005446460529992
0.0     92778
1.0     62667
5.0     55542
6.0     45614
2.0     39434
8.0     30475
11.0    26710
29.0    26577
7.0     26508
13.0    26085
10.0    25789
31.0    23987
12.0    23300
30.0    22361
15.0    20062
3.0     19985
19.0    19484
37.0    18525
4.0     17595
14.0    17529
20.0    17132
32.0    17105
39.0    16182
40.0    15150
27.0    14475
16.0    14466
38.0    13914
35.0    13679
34.0    13074
9.0     13066
21.0    12766
28.0    12264
24.0    12091
36.0    10505
25.0    10370
23.0     9191
22.0     7224
18.0     7168
33.0     6066
17.0     5888
26.0     3584
Name: LP_LEBENSPHASE_FEIN, dtype: int64

GFK_URLAUBERTYP
vacation habits
Null percentage: 0.005446460529992
12.0    138545
5.0     120126
10.0    109127
8.0      88042
11.0     79740
4.0      63770
9.0      60614
3.0      56007
1.0      53600
2.0      46702
7.0      42956
6.0      27138
Name: GFK_URLAUBERTYP, dtype: int64

CJT_GESAMTTYP
Customer-Journey-Typology relating to the preferred information and buying channels of consumers
Null percentage: 0.005446460529992
4.0    210963
3.0    156449
6.0    153915
2.0    148795
5.0    117376
1.0     98869
Name: CJT_GESAMTTYP, dtype: int64

RETOURTYP_BK_S
return type
Null percentage: 0.005446460529992
5.0    297993
3.0    231816
4.0    131115
1.0    129712
2.0     95731
Name: RETOURTYP_BK_S, dtype: int64

PRAEGENDE_JUGENDJAHRE
dominating movement in the person's youth (avantgarde or mainstream)
Null percentage: 0.12136608091595687
14.0    188697
8.0     145988
5.0      86416
10.0     85808
3.0      55195
15.0     42547
11.0     35752
9.0      33570
6.0      25652
12.0     24446
1.0      21282
4.0      20451
2.0       7479
13.0      5764
7.0       4010
Name: PRAEGENDE_JUGENDJAHRE, dtype: int64

NATIONALITAET_KZ
nationaltity
Null percentage: 0.12153551139391913
1.0    684085
2.0     65418
3.0     33403
Name: NATIONALITAET_KZ, dtype: int64

VERS_TYP
insurance typology 
Null percentage: 0.12476815514894735
2.0    398722
1.0    381303
Name: VERS_TYP, dtype: int64

HEALTH_TYP
health typology
Null percentage: 0.12476815514894735
3.0    310693
2.0    306944
1.0    162388
Name: HEALTH_TYP, dtype: int64

SHOPPER_TYP
shopping typology
Null percentage: 0.12476815514894735
1.0    254761
2.0    207463
3.0    190219
0.0    127582
Name: SHOPPER_TYP, dtype: int64

AGER_TYP
best-ager typology
Null percentage: 0.7601964047076988
2.0    98472
1.0    79802
3.0    27104
0.0     8340
Name: AGER_TYP, dtype: int64

TITEL_KZ
flag whether this person holds an academic title
Null percentage: 0.9975763587258379
1.0    1947
5.0     104
4.0      57
3.0      49
2.0       3
Name: TITEL_KZ, dtype: int64

Notes about the features:

  1. GEBURTSJAHR (year of birth) has 44% missing values (encoded as 0), which need to be dropped, as we can't infer year of birth, and I don't think it would be that indicative of anything when we have much more deep information about each individual.
  2. Some features have 0.5% missing values which are mostly related to social status, and it's peculiar why this data is missing about these individuals. If these individuals have more missing data in the other feature categories then we can safely drop them and not worry about their missing data type.
  3. PRAEGENDE_JUGENDJAHRE (dominating movement in the person's youth (avantgarde or mainstream)), NATIONALITAET_KZ (nationaltity), VERS_TYP (insurance typology), HEALTH_TYP (health typology) and SHOPPER_TYP (shopping typology) have around 12% missing values. We can only talk about imputing these values when we drop the rows with extreme missing values, and then we can asses how we can deal with them.
  4. AGER_TYP (best-ager typology) has 76% missing values, in addtion to around 1% of individuals that couldn't be classified (encode as 0).
  5. TITEL_KZ (flag whether this person holds an academic title) has 99% missing values, and it could be just that only 1% hold academic titles and hence the data is correct, and it could be that the data isn't complete and hence we should drop it.

First let's determine a threshold for the percentage of missing values in a feature to include this test, and that should be the highest repeating missing values percentage we have seen so far which is 16.6%.

In [52]:
# all features with missing percentage less than 17% and higher than 0%
features_missing = azdias.columns[((azdias_new.isnull().sum() / azdias.shape[0]) < 0.17) &
                                  ((azdias_new.isnull().sum() / azdias.shape[0])) > 0]

# narrow down to only features of the explored categories
plz8_feats = dias_atts[dias_atts["Information level"] == "PLZ8"]["Attribute"].unique()
rr3_feats = dias_atts[dias_atts["Information level"] == "Microcell (RR3_ID)"]["Attribute"].unique()
person_feats = dias_atts[dias_atts["Information level"] == "Person"]["Attribute"].unique()
features_missing = list(set(plz8_feats).union(rr3_feats).union(person_feats).intersection(features_missing))

# azdias with only features that have missing values
azdias_missing = azdias_new[features_missing]

# flag rows with all missing values
rows_all_missing = azdias_missing.isna().sum(axis=1) == azdias_missing.shape[1]

print("Number of rows with all missing values:", rows_all_missing.sum())
print("Percentage of rows with all missing values:", rows_all_missing.sum()/azdias_missing.shape[0])
Number of rows with all missing values: 7
Percentage of rows with all missing values: 7.854393018117841e-06

We can see that there are 7 rows that have all the values for the features explored missing, and we already know that there are rows with high percentage of missing features. But how many are there exactly?

In [53]:
rows_missing_p = azdias_missing.isnull().sum(axis=1)/azdias_missing.shape[1]
for i in np.arange(0, 0.8, 0.1):
    print("Percentage of rows with more than {:.2f}% values missing: {}".format(i*100, (rows_missing_p>i).sum()/azdias.shape[0]))
Percentage of rows with more than 0.00% values missing: 0.22381092905126787
Percentage of rows with more than 10.00% values missing: 0.1727225906929931
Percentage of rows with more than 20.00% values missing: 0.17272146863684765
Percentage of rows with more than 30.00% values missing: 0.1315577168850375
Percentage of rows with more than 40.00% values missing: 0.11871690635656026
Percentage of rows with more than 50.00% values missing: 0.11871354018812394
Percentage of rows with more than 60.00% values missing: 0.11799766836732976
Percentage of rows with more than 70.00% values missing: 0.11217980725319533

So let's look at this information using the whole dataset.

In [54]:
rows_missing_p = azdias_new.isnull().sum(axis=1)/azdias_new.shape[1]
for i in np.arange(0, 0.8, 0.1):
    print("Percentage of rows with more than {:.2f}% values missing: {}".format(i*100, (rows_missing_p>i).sum()/azdias.shape[0]))
Percentage of rows with more than 0.00% values missing: 1.0
Percentage of rows with more than 10.00% values missing: 0.84363474379531
Percentage of rows with more than 20.00% values missing: 0.17351027410709577
Percentage of rows with more than 30.00% values missing: 0.14224866783884133
Percentage of rows with more than 40.00% values missing: 0.11868997700906958
Percentage of rows with more than 50.00% values missing: 0.11244685661581134
Percentage of rows with more than 60.00% values missing: 0.11166141731399956
Percentage of rows with more than 70.00% values missing: 0.10016146387932959

What are the features that have the most missing values in rows with more than 50% missing data?

In [55]:
# calculate features missing values
feat_missing_count = azdias_new.isna().sum()

# filter out rows with more than 50% missing values
half_missing_rows = azdias_new[rows_missing_p > 0.5]

# transpose the dataframe to make the features as index
half_missing_rows_t = half_missing_rows.T

# calculate the percentage of null values in these rows for each feature
half_missing_rows_t["null_percentage"] = half_missing_rows_t.isna().sum(axis=1)/feat_missing_count

# sort values by null_percentage
half_missing_rows_t = half_missing_rows_t.sort_values("null_percentage", ascending=False)

# select features that have null values
half_missing_rows_t = half_missing_rows_t.query("null_percentage > 0")
In [56]:
# print each feature, category and percent of missing values
category_missing = defaultdict(list)
category_missing_p = defaultdict(list)
category_missing_count = defaultdict(int)
for feat, p in half_missing_rows_t["null_percentage"].iteritems():
    if feat in set(dias_atts.Attribute):
        category = dias_atts.query("Attribute == @feat")["Information level"].item()
    else:
        category = "Unknown" 
    category_missing_p[category].append(p)
    if p > 0.9:
        category_missing[category].append(feat)
        category_missing_count[category] += 1
    print(feat, category, p)
WOHNLAGE Building 1.0
SOHO_KZ Unknown 1.0
MOBI_RASTER Unknown 1.0
MIN_GEBAEUDEJAHR Building 1.0
DSL_FLAG Unknown 1.0
ANZ_TITEL Household 1.0
EINGEFUEGT_AM Unknown 1.0
ANZ_PERSONEN Household 1.0
EINGEZOGENAM_HH_JAHR Unknown 1.0
GEBAEUDETYP Building 1.0
ANZ_HAUSHALTE_AKTIV Building 1.0
UNGLEICHENN_FLAG Unknown 1.0
HH_EINKOMMEN_SCORE Household 1.0
KBA05_MODTEMP Building 1.0
AKT_DAT_KL Unknown 1.0
WOHNDAUER_2008 Household 1.0
OST_WEST_KZ Building 1.0
GEBAEUDETYP_RASTER RR1_ID 0.9999463260157802
KONSUMZELLE Unknown 0.9999463260157802
FIRMENDICHTE Unknown 0.9999463260157802
ANZ_STATISTISCHE_HAUSHALTE Unknown 0.9995707740017813
KONSUMNAEHE Building 0.9992158877367546
EWDICHTE Postcode  0.9938660123746533
INNENSTADT Postcode  0.9938660123746533
BALLRAUM Postcode  0.9938660123746533
VK_DISTANZ Unknown 0.9692822424489903
VK_ZG11 Unknown 0.9692822424489903
VK_DHT4A Unknown 0.9692822424489903
ANZ_HH_TITEL Building 0.9614877123536203
CAMEO_DEU_2015 Microcell (RR4_ID) 0.9596176966831348
CAMEO_INTL_2015 Unknown 0.9596176966831348
UMFELD_ALT Unknown 0.9592272922504244
UMFELD_JUNG Unknown 0.9592272922504244
ORTSGR_KLS9 Community 0.9584430546412114
ARBEIT Community 0.9584430546412114
RELAT_AB Community 0.9584430546412114
GEMEINDETYP Unknown 0.9578715792503649
STRUKTURTYP Unknown 0.9578715792503649
CAMEO_DEUG_2015 Microcell (RR4_ID) 0.9562263467267896
KBA13_HALTER_65 PLZ8 0.9472117202268431
KBA13_HALTER_60 PLZ8 0.9472117202268431
KBA13_HALTER_50 PLZ8 0.9472117202268431
KBA13_HALTER_45 PLZ8 0.9472117202268431
KBA13_HALTER_40 PLZ8 0.9472117202268431
KBA13_HALTER_35 PLZ8 0.9472117202268431
KBA13_HALTER_55 PLZ8 0.9472117202268431
KBA13_HERST_BMW_BENZ PLZ8 0.9472117202268431
KBA13_HALTER_66 PLZ8 0.9472117202268431
KBA13_HERST_ASIEN PLZ8 0.9472117202268431
KBA13_HERST_AUDI_VW PLZ8 0.9472117202268431
KBA13_HALTER_25 PLZ8 0.9472117202268431
KBA13_HERST_EUROPA PLZ8 0.9472117202268431
KBA13_HERST_FORD_OPEL PLZ8 0.9472117202268431
KBA13_HERST_SONST PLZ8 0.9472117202268431
KBA13_HHZ Unknown 0.9472117202268431
KBA13_KMH_0_140 PLZ8 0.9472117202268431
KBA13_KMH_110 PLZ8 0.9472117202268431
KBA13_KMH_140 PLZ8 0.9472117202268431
KBA13_KMH_140_210 PLZ8 0.9472117202268431
KBA13_HALTER_30 PLZ8 0.9472117202268431
KBA13_FIAT PLZ8 0.9472117202268431
KBA13_HALTER_20 PLZ8 0.9472117202268431
KBA13_CCM_1200 PLZ8 0.9472117202268431
KBA13_ALTERHALTER_61 PLZ8 0.9472117202268431
KBA13_AUDI PLZ8 0.9472117202268431
KBA13_AUTOQUOTE PLZ8 0.9472117202268431
KBA13_BAUMAX Unknown 0.9472117202268431
KBA13_BJ_1999 PLZ8 0.9472117202268431
KBA13_BJ_2000 PLZ8 0.9472117202268431
KBA13_BJ_2004 PLZ8 0.9472117202268431
KBA13_BJ_2006 PLZ8 0.9472117202268431
KBA13_BJ_2008 PLZ8 0.9472117202268431
KBA13_BJ_2009 PLZ8 0.9472117202268431
KBA13_BMW PLZ8 0.9472117202268431
KBA13_CCM_0_1400 PLZ8 0.9472117202268431
KBA13_CCM_1000 PLZ8 0.9472117202268431
KBA13_CCM_1400 PLZ8 0.9472117202268431
KBA13_GBZ Unknown 0.9472117202268431
KBA13_CCM_1401_2500 Unknown 0.9472117202268431
KBA13_CCM_1500 PLZ8 0.9472117202268431
KBA13_CCM_1600 PLZ8 0.9472117202268431
KBA13_CCM_1800 PLZ8 0.9472117202268431
KBA13_CCM_2000 PLZ8 0.9472117202268431
KBA13_CCM_2500 PLZ8 0.9472117202268431
KBA13_CCM_2501 PLZ8 0.9472117202268431
KBA13_CCM_3000 Unknown 0.9472117202268431
KBA13_CCM_3001 Unknown 0.9472117202268431
KBA13_FAB_ASIEN PLZ8 0.9472117202268431
KBA13_FAB_SONSTIGE PLZ8 0.9472117202268431
KBA13_KMH_210 Unknown 0.9472117202268431
KBA13_FORD PLZ8 0.9472117202268431
KBA13_KMH_180 PLZ8 0.9472117202268431
KBA13_KRSSEG_OBER PLZ8 0.9472117202268431
KBA13_KMH_211 PLZ8 0.9472117202268431
KBA13_KMH_250 PLZ8 0.9472117202268431
KBA13_SEG_GELAENDEWAGEN PLZ8 0.9472117202268431
KBA13_SEG_GROSSRAUMVANS PLZ8 0.9472117202268431
KBA13_SEG_KLEINST PLZ8 0.9472117202268431
KBA13_SEG_KLEINWAGEN PLZ8 0.9472117202268431
KBA13_SEG_KOMPAKTKLASSE PLZ8 0.9472117202268431
KBA13_SEG_MINIVANS PLZ8 0.9472117202268431
KBA13_SEG_MINIWAGEN PLZ8 0.9472117202268431
KBA13_SEG_MITTELKLASSE PLZ8 0.9472117202268431
KBA13_SEG_OBEREMITTELKLASSE PLZ8 0.9472117202268431
KBA13_SEG_OBERKLASSE PLZ8 0.9472117202268431
KBA13_SEG_SONSTIGE PLZ8 0.9472117202268431
KBA13_SEG_SPORTWAGEN PLZ8 0.9472117202268431
KBA13_SEG_UTILITIES PLZ8 0.9472117202268431
KBA13_SEG_VAN PLZ8 0.9472117202268431
KBA13_SEG_WOHNMOBILE PLZ8 0.9472117202268431
KBA13_SITZE_4 PLZ8 0.9472117202268431
KBA13_SITZE_5 PLZ8 0.9472117202268431
KBA13_SITZE_6 PLZ8 0.9472117202268431
KBA13_TOYOTA PLZ8 0.9472117202268431
KBA13_VORB_0 PLZ8 0.9472117202268431
KBA13_VORB_1 PLZ8 0.9472117202268431
KBA13_VORB_1_2 PLZ8 0.9472117202268431
KBA13_VORB_2 PLZ8 0.9472117202268431
KBA13_VORB_3 PLZ8 0.9472117202268431
KBA13_VW PLZ8 0.9472117202268431
KBA13_RENAULT PLZ8 0.9472117202268431
KBA13_PEUGEOT PLZ8 0.9472117202268431
KBA13_OPEL PLZ8 0.9472117202268431
KBA13_KW_120 PLZ8 0.9472117202268431
KBA13_KMH_251 PLZ8 0.9472117202268431
KBA13_KRSAQUOT PLZ8 0.9472117202268431
KBA13_KRSHERST_AUDI_VW PLZ8 0.9472117202268431
KBA13_KRSHERST_BMW_BENZ PLZ8 0.9472117202268431
KBA13_KRSHERST_FORD_OPEL PLZ8 0.9472117202268431
KBA13_KRSSEG_KLEIN PLZ8 0.9472117202268431
KBA13_KRSSEG_VAN PLZ8 0.9472117202268431
KBA13_KRSZUL_NEU PLZ8 0.9472117202268431
KBA13_KW_0_60 PLZ8 0.9472117202268431
KBA13_KW_110 PLZ8 0.9472117202268431
KBA13_ALTERHALTER_45 PLZ8 0.9472117202268431
KBA13_KW_121 PLZ8 0.9472117202268431
KBA13_NISSAN PLZ8 0.9472117202268431
KBA13_KW_30 PLZ8 0.9472117202268431
KBA13_KW_40 PLZ8 0.9472117202268431
KBA13_KW_50 PLZ8 0.9472117202268431
KBA13_KW_60 PLZ8 0.9472117202268431
KBA13_KW_61_120 PLZ8 0.9472117202268431
KBA13_KW_70 PLZ8 0.9472117202268431
KBA13_KW_80 PLZ8 0.9472117202268431
KBA13_KW_90 PLZ8 0.9472117202268431
KBA13_MAZDA PLZ8 0.9472117202268431
KBA13_MERCEDES PLZ8 0.9472117202268431
KBA13_MOTOR PLZ8 0.9472117202268431
KBA13_ALTERHALTER_60 PLZ8 0.9472117202268431
KBA13_ANZAHL_PKW PLZ8 0.9472117202268431
KBA13_ALTERHALTER_30 PLZ8 0.9472117202268431
KBA13_ANTG1 Unknown 0.880561999156474
KBA13_ANTG2 Unknown 0.8596832858662458
PLZ8_ANTG3 PLZ8 0.8586190619233575
PLZ8_HHZ PLZ8 0.8586190619233575
PLZ8_GBZ PLZ8 0.8586190619233575
PLZ8_BAUMAX PLZ8 0.8586190619233575
PLZ8_ANTG4 PLZ8 0.8586190619233575
PLZ8_ANTG2 PLZ8 0.8586190619233575
PLZ8_ANTG1 PLZ8 0.8586190619233575
KBA05_HERSTTEMP Building 0.8540945318392904
MOBI_REGIO RR1_ID 0.7443071014971048
KBA05_ANTG1 Microcell (RR4_ID) 0.7443071014971048
KBA05_ANTG2 Microcell (RR4_ID) 0.7443071014971048
KBA05_ANTG3 Microcell (RR4_ID) 0.7443071014971048
KBA05_ANTG4 Microcell (RR4_ID) 0.7443071014971048
KBA05_AUTOQUOT Microcell (RR3_ID) 0.7443071014971048
KBA05_GBZ Microcell (RR3_ID) 0.7443071014971048
PRAEGENDE_JUGENDJAHRE Person 0.743713250249621
HH_DELTA_FLAG Unknown 0.69360234939871
NATIONALITAET_KZ Person 0.6903475972856945
KBA05_MOTRAD Microcell (RR3_ID) 0.6818733962985205
KBA05_ANHANG Microcell (RR4_ID) 0.6790307131065686
KBA05_SEG3 Microcell (RR3_ID) 0.6750185672810749
KBA05_SEG8 Microcell (RR3_ID) 0.6750185672810749
KBA05_SEG7 Microcell (RR3_ID) 0.6750185672810749
KBA05_SEG2 Microcell (RR3_ID) 0.6750185672810749
KBA05_SEG5 Microcell (RR3_ID) 0.6750185672810749
KBA05_CCM2 Microcell (RR3_ID) 0.6750185672810749
KBA05_CCM1 Microcell (RR3_ID) 0.6750185672810749
KBA05_CCM3 Microcell (RR3_ID) 0.6750185672810749
KBA05_CCM4 Microcell (RR3_ID) 0.6750185672810749
KBA05_DIESEL Microcell (RR3_ID) 0.6750185672810749
KBA05_FRAU Microcell (RR3_ID) 0.6750185672810749
KBA05_SEG9 Microcell (RR3_ID) 0.6750185672810749
KBA05_ALTER3 Microcell (RR4_ID) 0.6750185672810749
KBA05_ALTER4 Microcell (RR4_ID) 0.6750185672810749
KBA05_HERST2 Microcell (RR3_ID) 0.6750185672810749
KBA05_ALTER2 Microcell (RR4_ID) 0.6750185672810749
KBA05_ALTER1 Microcell (RR4_ID) 0.6750185672810749
KBA05_ZUL4 Microcell (RR3_ID) 0.6750185672810749
KBA05_VORB0 Microcell (RR3_ID) 0.6750185672810749
KBA05_VORB1 Microcell (RR3_ID) 0.6750185672810749
KBA05_VORB2 Microcell (RR3_ID) 0.6750185672810749
KBA05_ZUL1 Microcell (RR3_ID) 0.6750185672810749
KBA05_ZUL2 Microcell (RR3_ID) 0.6750185672810749
KBA05_ZUL3 Microcell (RR3_ID) 0.6750185672810749
KBA05_HERST1 Microcell (RR3_ID) 0.6750185672810749
KBA05_SEG6 Microcell (RR3_ID) 0.6750185672810749
KBA05_HERST3 Microcell (RR3_ID) 0.6750185672810749
KBA05_KW2 Microcell (RR3_ID) 0.6750185672810749
KBA05_MAXVORB Microcell (RR3_ID) 0.6750185672810749
KBA05_MOD1 Microcell (RR3_ID) 0.6750185672810749
KBA05_MOD2 Microcell (RR3_ID) 0.6750185672810749
KBA05_HERST4 Microcell (RR3_ID) 0.6750185672810749
KBA05_MOD3 Microcell (RR3_ID) 0.6750185672810749
KBA05_MOD4 Microcell (RR3_ID) 0.6750185672810749
KBA05_MOD8 Microcell (RR3_ID) 0.6750185672810749
KBA05_MOTOR Microcell (RR3_ID) 0.6750185672810749
KBA05_SEG1 Microcell (RR3_ID) 0.6750185672810749
KBA05_MAXSEG Microcell (RR3_ID) 0.6750185672810749
KBA05_MAXHERST Microcell (RR3_ID) 0.6750185672810749
KBA05_MAXBJ Microcell (RR3_ID) 0.6750185672810749
KBA05_KW3 Microcell (RR3_ID) 0.6750185672810749
KBA05_MAXAH Microcell (RR3_ID) 0.6750185672810749
KBA05_KW1 Microcell (RR3_ID) 0.6750185672810749
KBA05_SEG10 Microcell (RR3_ID) 0.6750185672810749
KBA05_HERST5 Microcell (RR3_ID) 0.6750185672810749
KBA05_SEG4 Microcell (RR3_ID) 0.6750185672810749
KBA05_KRSAQUOT Microcell (RR3_ID) 0.6750185672810749
KBA05_KRSHERST1 Microcell (RR3_ID) 0.6750185672810749
KBA05_KRSHERST3 Microcell (RR3_ID) 0.6750185672810749
KBA05_KRSHERST2 Microcell (RR3_ID) 0.6750185672810749
KBA05_KRSKLEIN Microcell (RR3_ID) 0.6750185672810749
KBA05_KRSOBER Microcell (RR3_ID) 0.6750185672810749
KBA05_KRSVAN Microcell (RR3_ID) 0.6750185672810749
KBA05_KRSZUL Microcell (RR3_ID) 0.6750185672810749
HEALTH_TYP Person 0.6734324975718551
VERS_TYP Person 0.6734324975718551
SHOPPER_TYP Person 0.6734324975718551
RT_UEBERGROESSE Unknown 0.627224152453148
REGIOTYP RR1_ID 0.6120685292033606
KKK RR1_ID 0.6120685292033606
VHN Unknown 0.6120685292033606
W_KEIT_KIND_HH Household 0.6034002756980296
KBA13_ANTG3 Unknown 0.41757989916246513
D19_BANKEN_ONLINE_QUOTE_12 Household 0.31038881736823887
D19_VERSAND_ONLINE_QUOTE_12 Household 0.31038881736823887
D19_LETZTER_KAUF_BRANCHE Unknown 0.31038881736823887
D19_KONSUMTYP Household 0.31038881736823887
D19_GESAMT_ONLINE_QUOTE_12 Household 0.31038881736823887
ALTERSKATEGORIE_FEIN Unknown 0.2698867279333191
ALTER_HH Household 0.2653778842094067
KBA05_BAUMAX Microcell (RR3_ID) 0.20942491878688166
VERDICHTUNGSRAUM Unknown 0.20798788128465248
KBA13_ANTG4 Unknown 0.2059515613600633
D19_SONSTIGE Unknown 0.17493719772389926
KK_KUNDENTYP Unknown 0.1533547036324947
D19_VOLLSORTIMENT Unknown 0.15177949406835312
D19_BUCH_CD Unknown 0.14781273884532137
D19_TECHNIK Unknown 0.1467907525936318
EXTSEL992 Unknown 0.14351841235918814
D19_VERSICHERUNGEN Unknown 0.14117929197267606
AGER_TYP Person 0.1390237386402717
D19_BEKLEIDUNG_REST Unknown 0.13498011558089362
D19_HAUS_DEKO Unknown 0.13248913195905201
D19_TELKO_MOBILE Unknown 0.13068585203163438
VHA Unknown 0.1303870124457747
D19_REISEN Unknown 0.13035672606673143
D19_BANKEN_DIREKT Unknown 0.13011603831446011
D19_VERSAND_REST Unknown 0.12926139844943507
D19_LOTTO Unknown 0.12903299124547432
D19_KOSMETIK Unknown 0.12877093623799335
D19_KINDERARTIKEL Unknown 0.1274358957250472
D19_SOZIALES Unknown 0.1272929361510261
D19_HANDWERK Unknown 0.12597656631280577
D19_DROGERIEARTIKEL Unknown 0.12589912931956573
D19_TELKO_REST Unknown 0.1253843151129348
D19_SCHUHE Unknown 0.12413974210373804
D19_BANKEN_GROSS Unknown 0.12298959318826869
D19_FREIZEIT Unknown 0.12250173253678795
D19_SAMMELARTIKEL Unknown 0.12198956469700842
D19_RATGEBER Unknown 0.1211595002179932
ANZ_KINDER Unknown 0.12096811272198135
D19_BEKLEIDUNG_GEH Unknown 0.12049489437837944
D19_BILDUNG Unknown 0.1204184682889876
ALTER_KIND1 Unknown 0.1204165087766289
D19_BANKEN_REST Unknown 0.11909073208722741
D19_ENERGIE Unknown 0.11861320685371093
D19_WEIN_FEINKOST Unknown 0.11796441274329002
D19_LEBENSMITTEL Unknown 0.11764333809913667
D19_NAHRUNGSERGAENZUNG Unknown 0.11633042939486679
D19_GARTEN Unknown 0.11621768240988416
D19_TIERARTIKEL Unknown 0.11621412311374997
D19_BIO_OEKO Unknown 0.11614450270117109
D19_DIGIT_SERV Unknown 0.11559812093589425
ALTER_KIND2 Unknown 0.11524134233546318
D19_BANKEN_LOKAL Unknown 0.11385489485507205
ALTER_KIND3 Unknown 0.11303981352487032
TITEL_KZ Person 0.11264356438984502
D19_VERSI_ONLINE_QUOTE_12 Unknown 0.11259646460388388
ALTER_KIND4 Unknown 0.11255865063099989
D19_TELKO_ONLINE_QUOTE_12 Unknown 0.11252278385908877
LP_LEBENSPHASE_GROB Person 0.033992583436341164
GFK_URLAUBERTYP Person 0.033992583436341164
CJT_TYP_6 Unknown 0.033992583436341164
LP_FAMILIE_FEIN Person 0.033992583436341164
LP_FAMILIE_GROB Person 0.033992583436341164
LP_LEBENSPHASE_FEIN Person 0.033992583436341164
CJT_TYP_2 Unknown 0.033992583436341164
CJT_TYP_3 Unknown 0.033992583436341164
CJT_GESAMTTYP Person 0.033992583436341164
CJT_KATALOGNUTZER Unknown 0.033992583436341164
LP_STATUS_GROB Person 0.033992583436341164
CJT_TYP_5 Unknown 0.033992583436341164
RT_SCHNAEPPCHEN Unknown 0.033992583436341164
RT_KEIN_ANREIZ Unknown 0.033992583436341164
RETOURTYP_BK_S Person 0.033992583436341164
CJT_TYP_1 Unknown 0.033992583436341164
CJT_TYP_4 Unknown 0.033992583436341164
ONLINE_AFFINITAET RR1_ID 0.033992583436341164
LP_STATUS_FEIN Person 0.033992583436341164

This doesn't give us much information about which feature categories are most affected by these rows, except that that majority of these features have at least more than 60% missing values in these row.

In [221]:
plt.figure(figsize=(10, 4))
for category, p in category_missing_p.items():
    plt.hist(p, alpha=0.5, label=category)
plt.title("Distribution of Null Values Ratio between Data with >= 50% Missing Values and Full Data by Category")
plt.xlabel("Null Ratio")
plt.legend()
plt.tight_layout()
plt.savefig("null_ratio_1.png");
In [58]:
category_mean_p = pd.Series({category: np.mean(p) for category, p in category_missing_p.items()}).sort_values(ascending=False)
category_mean_p.plot(kind="bar", title="Mean Null Values Ratio between Data with >= 50% Missing Values and Full Data");
plt.xlabel("Categories")
plt.ylabel("Mean Null Ratio")
plt.tight_layout()
plt.savefig("null_ratio_2.png");

Using these plots we can see that most of the features can use dropping these rows to get rid of most of the missing values in the data, and the rest can be imputed using the imputer of our choice.

If we dropped these rows, how many features in each category would have more than 90% of their missing values removed?

In [59]:
# first we need to add the original count of Unknown features
original_category_count = dias_atts["Information level"].value_counts()
original_category_count["Unknown"] = len(azdias_feats.difference(dias_atts_feats))
original_category_count = original_category_count.sort_values(ascending=False)
In [225]:
category_half_missing_count = pd.Series(category_missing_count).sort_values(ascending=False)
category_half_missing_count[original_category_count.index].plot(kind="bar", alpha=0.5, color='blue', label="Missing rows")
original_category_count.plot(kind="bar", alpha=0.5, color='green', label="All", figsize=(9, 4))
plt.xticks(rotation=90);
plt.legend()
plt.title("Number of Features with More than 90% Missing Values in Rows with More than 50% Missing Values")
plt.tight_layout()
plt.savefig("drop_missing_rows.png")

This graph gives a much clearer picture of what is going on, so let me explain:

  1. First this graph has the count of features that have more than 90% missing values only in the rows with more than 50% missing values, so it's basically a win-win if we dropped these rows as it will automatically fix the problem of these features missing values, and the rest can be imputed.
  2. The majority of PLZ8, Building, Postcode and Community features are completely missing in rows with more than 50% missing values so if we drop these rows we won't need to impute them.
  3. Around 25% of the features with no known category won't need imputation if we dropped rows with more than 50% missing values.
  4. Some features from Household, RR4 and RR1 categories will be fixed.

Household Features

In [61]:
explore_category("Household", azdias_new)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Number of features in Household: 25


D19_TELKO_ONLINE_DATUM
actuality of the last transaction for the segment telecommunication ONLINE
Null percentage: 0.0
10    883018
9       4664
8       1728
7        566
5        496
6        457
4        114
1         68
3         64
2         46
Name: D19_TELKO_ONLINE_DATUM, dtype: int64

D19_VERSI_DATUM
actuality of the last transaction for the segment insurance TOTAL
Null percentage: 0.0
10    660478
9      80231
8      36328
5      29040
6      23236
7      21016
2      15916
4       9593
1       8760
3       6623
Name: D19_VERSI_DATUM, dtype: int64

D19_VERSI_ONLINE_DATUM
actuality of the last transaction for the segment insurance ONLINE
Null percentage: 0.0
10    883826
9       2766
8       1254
7       1045
5        982
6        684
4        292
2        136
3        132
1        104
Name: D19_VERSI_ONLINE_DATUM, dtype: int64

D19_VERSI_OFFLINE_DATUM
actuality of the last transaction for the segment insurance OFFLINE
Null percentage: 0.0
10    858212
9      17992
8       7013
5       3368
7       1911
6       1586
4        544
2        207
3        205
1        183
Name: D19_VERSI_OFFLINE_DATUM, dtype: int64

D19_GESAMT_OFFLINE_DATUM
actuality of the last transaction with the complete file OFFLINE
Null percentage: 0.0
10    558558
9     147456
8      69382
5      34827
7      27784
6      22717
4       9547
2       9188
1       6311
3       5451
Name: D19_GESAMT_OFFLINE_DATUM, dtype: int64

D19_GESAMT_ONLINE_DATUM
actuality of the last transaction with the complete file ONLINE
Null percentage: 0.0
10    450995
9      86118
5      79773
1      57331
8      42628
2      41391
6      36995
4      36227
7      33434
3      26329
Name: D19_GESAMT_ONLINE_DATUM, dtype: int64

D19_GESAMT_DATUM
actuality of the last transaction with the complete file TOTAL
Null percentage: 0.0
10    354170
9      98281
5      93740
1      76009
2      59774
8      52852
4      44031
6      42008
7      37486
3      32870
Name: D19_GESAMT_DATUM, dtype: int64

D19_BANKEN_OFFLINE_DATUM
actuality of the last transaction for the segment banks OFFLINE
Null percentage: 0.0
10    871535
8       6451
9       5297
5       4177
2       2058
6        509
1        476
7        335
4        311
3         72
Name: D19_BANKEN_OFFLINE_DATUM, dtype: int64

D19_BANKEN_ONLINE_DATUM
actuality of the last transaction for the segment banks ONLINE
Null percentage: 0.0
10    726982
9      66077
8      22939
5      22124
7      16321
6      13668
1       6917
4       6869
2       4965
3       4359
Name: D19_BANKEN_ONLINE_DATUM, dtype: int64

D19_BANKEN_DATUM
actuality of the last transaction for the segment banks TOTAL
Null percentage: 0.0
10    678331
9      82707
8      33062
5      29494
7      20482
6      17152
1       8495
4       8406
2       8001
3       5091
Name: D19_BANKEN_DATUM, dtype: int64

D19_TELKO_OFFLINE_DATUM
actuality of the last transaction for the segment telecommunication OFFLINE
Null percentage: 0.0
10    819114
9      36707
8      18620
5       6309
6       3971
7       3590
4       1169
1        682
2        544
3        515
Name: D19_TELKO_OFFLINE_DATUM, dtype: int64

D19_VERSAND_DATUM
actuality of the last transaction for the segment mail-order TOTAL
Null percentage: 0.0
10    437886
9     100846
5      78589
1      53921
8      50332
2      40994
6      37125
4      34157
7      32104
3      25267
Name: D19_VERSAND_DATUM, dtype: int64

D19_TELKO_DATUM
actuality of the last transaction for the segment telecommunication TOTAL
Null percentage: 0.0
10    665798
9     117950
8      42460
5      19492
7      18163
6      13619
4       5314
1       3079
2       2818
3       2528
Name: D19_TELKO_DATUM, dtype: int64

D19_VERSAND_OFFLINE_DATUM
actuality of the last transaction for the segment mail-order OFFLINE
Null percentage: 0.0
10    634233
9     124063
8      57907
5      22691
7      19855
6      15543
4       5363
2       4845
1       3450
3       3271
Name: D19_VERSAND_OFFLINE_DATUM, dtype: int64

D19_VERSAND_ONLINE_DATUM
actuality of the last transaction for the segment mail-order ONLINE
Null percentage: 0.0
10    494464
9      82541
5      70781
1      49813
8      38330
2      37014
6      34143
4      31525
7      29454
3      23156
Name: D19_VERSAND_ONLINE_DATUM, dtype: int64

HH_EINKOMMEN_SCORE
estimated household_net_income 
Null percentage: 0.020587486156632306
6.0    252775
5.0    201482
2.0    140817
4.0    139762
3.0     84805
1.0     53232
Name: HH_EINKOMMEN_SCORE, dtype: int64

WOHNDAUER_2008
length of residenca
Null percentage: 0.08247000463409188
9.0    551176
8.0     80118
4.0     50736
3.0     38767
6.0     35170
5.0     30959
7.0     23939
2.0      6174
1.0       683
Name: WOHNDAUER_2008, dtype: int64

ANZ_TITEL
number of bearers of an academic title within this household
Null percentage: 0.08247000463409188
0.0    814542
1.0      2970
2.0       202
3.0         5
4.0         2
6.0         1
Name: ANZ_TITEL, dtype: int64

ANZ_PERSONEN
number of persons known in this household
Null percentage: 0.08247000463409188
1.0     423383
2.0     195470
3.0      94905
4.0      47126
0.0      34103
5.0      15503
6.0       4842
7.0       1525
8.0        523
9.0        180
10.0        67
11.0        38
12.0        16
13.0        11
21.0         4
14.0         4
20.0         3
15.0         3
38.0         2
23.0         2
37.0         2
22.0         2
35.0         1
17.0         1
16.0         1
45.0         1
18.0         1
40.0         1
29.0         1
31.0         1
Name: ANZ_PERSONEN, dtype: int64

W_KEIT_KIND_HH
likelihood of a child present in this household (can be specified in child age groups)
Null percentage: 0.16605084485217472
6.0    281966
4.0    128675
3.0    100170
2.0     84000
1.0     83706
5.0     64716
Name: W_KEIT_KIND_HH, dtype: int64

D19_KONSUMTYP
consumption type 
Null percentage: 0.2884952217239046
9.0    254296
1.0    117912
4.0     78262
6.0     56562
3.0     53330
2.0     49324
5.0     24422
Name: D19_KONSUMTYP, dtype: int64

D19_GESAMT_ONLINE_QUOTE_12
amount of online transactions within all transactions in the complete file 
Null percentage: 0.2884952217239046
0.0     393075
10.0    199906
5.0      10517
8.0       9467
7.0       6923
9.0       6046
3.0       3543
6.0       1679
2.0       1066
4.0       1017
1.0        869
Name: D19_GESAMT_ONLINE_QUOTE_12, dtype: int64

D19_BANKEN_ONLINE_QUOTE_12
amount of online transactions within all transactions in the segment bank 
Null percentage: 0.2884952217239046
0.0     588874
10.0     44065
5.0        391
3.0        220
7.0        214
8.0        172
9.0         67
6.0         50
2.0         35
4.0         18
1.0          2
Name: D19_BANKEN_ONLINE_QUOTE_12, dtype: int64

D19_VERSAND_ONLINE_QUOTE_12
amount of online transactions within all transactions in the segment mail-order 
Null percentage: 0.2884952217239046
0.0     417367
10.0    187652
5.0       8034
8.0       6419
7.0       4920
9.0       3931
3.0       2653
6.0       1080
2.0        751
4.0        739
1.0        562
Name: D19_VERSAND_ONLINE_QUOTE_12, dtype: int64

ALTER_HH
main age within the household
Null percentage: 0.34813699407890975
18.0    60852
17.0    55665
19.0    52890
15.0    51867
16.0    51857
14.0    44275
21.0    41610
20.0    40671
13.0    37612
12.0    34923
10.0    30419
11.0    27924
9.0     22817
8.0     13463
7.0      8419
6.0      3809
5.0      1030
4.0       603
3.0       200
2.0        47
1.0         1
Name: ALTER_HH, dtype: int64

Notes:

  1. ALTER_HH (main age within the household) has missing values encoded as 0.
  2. WOHNDAUER_2008 (length of residenca) won't need imputation as 100% of missing values are present in the rows that we are dropping.
  3. D19_GESAMT_ONLINE_QUOTE_12 and similar features encode wether a person has no transaction in the previous 12 months, but there are some null values of which only 30% are present in the rows we are going to drop.
  4. W_KEIT_KIND_HH (likelihood of a child present in this household (can be specified in child age groups)) has 8% missing values of which 60% are in rows that we are dropping.

Microcell (RR4_ID) Features

In [62]:
explore_category("Microcell (RR4_ID)", azdias_new)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Number of features in Microcell (RR4_ID): 11


CAMEO_DEU_2015
CAMEO_4.0: specific group
Null percentage: 0.11105999522004081
6B    56672
8A    52438
4C    47819
2D    35074
3C    34769
7A    34399
3D    34307
8B    33434
4A    33155
8C    30993
9D    28593
9B    27676
9C    24987
7B    24503
9A    20542
2C    19422
8D    17576
6E    16107
2B    15486
5D    14943
6C    14820
2A    13249
5A    12214
1D    11909
1A    10850
3A    10543
5B    10354
5C     9935
7C     9065
4B     9047
4D     8570
3B     7160
6A     6810
9E     6379
6D     6073
6F     5392
7D     5333
4E     5321
1E     5065
7E     4633
1C     4317
5F     4283
1B     4071
5E     3581
XX      373
Name: CAMEO_DEU_2015, dtype: int64

CAMEO_DEUG_2015
CAMEO_4.0: uppergroup
Null percentage: 0.11147852216229195
8      78023
9      62578
6      61253
4      60185
8.0    56418
3      50360
2      48276
9.0    45599
7      45021
6.0    44621
4.0    43727
3.0    36419
2.0    34955
7.0    32912
5      32292
5.0    23018
1      20997
1.0    15215
Name: CAMEO_DEUG_2015, dtype: int64

KBA05_ANTG1
number of 1-2 family houses in the cell
Null percentage: 0.14959701353536328
0.0    261049
1.0    161224
2.0    126725
3.0    117762
4.0     91137
Name: KBA05_ANTG1, dtype: int64

KBA05_ANTG2
number of 3-5 family houses in the cell
Null percentage: 0.14959701353536328
0.0    292538
1.0    163751
2.0    138273
3.0    134455
4.0     28880
Name: KBA05_ANTG2, dtype: int64

KBA05_ANTG3
number of 6-10 family houses in the cell
Null percentage: 0.14959701353536328
0.0    511545
1.0     92748
2.0     80234
3.0     73370
Name: KBA05_ANTG3, dtype: int64

KBA05_ANTG4
number of >10 family houses in the cell
Null percentage: 0.14959701353536328
0.0    600171
1.0     83591
2.0     74135
Name: KBA05_ANTG4, dtype: int64

KBA05_ANHANG
share of trailers in the microcell
Null percentage: 0.16516778666570917
1.0    323472
0.0    266145
3.0     81525
2.0     72878
Name: KBA05_ANHANG, dtype: int64

KBA05_ALTER1
share of car owners less than 31 years old
Null percentage: 0.16618773570191905
2.0    228625
1.0    167046
3.0    166129
0.0    102789
4.0     78522
Name: KBA05_ALTER1, dtype: int64

KBA05_ALTER2
share of car owners inbetween 31 and 45 years of age
Null percentage: 0.16618773570191905
3.0    288107
2.0    165806
4.0    159928
5.0     72236
1.0     57034
Name: KBA05_ALTER2, dtype: int64

KBA05_ALTER3
share of car owners inbetween 45 and 60 years of age
Null percentage: 0.16618773570191905
3.0    292436
2.0    158737
4.0    156194
1.0     68157
5.0     67587
Name: KBA05_ALTER3, dtype: int64

KBA05_ALTER4
share of cars owners elder than 61 years
Null percentage: 0.16618773570191905
3.0    299085
4.0    144073
2.0    138597
1.0     56822
5.0     54407
0.0     50127
Name: KBA05_ALTER4, dtype: int64

Notes:

  1. CAMEO_DEU_2015 will need one hot encoding and has 11% missing values which should be imputed.
  2. CAMEO_DEUG_2015 is a less detailed CAMEO_DEU_2015 and also has 11% missing values that should be imputed.
  3. KBA05_ANTG1 (number of 1-2 family houses in the cell), KBA05_ANTG2, KBA05_ANTG3 and KBA05_ANTG4 have 15% missing values of which 74% are present in rows we are going to drop.
  4. KBA05_ANHANG (share of trailers in the microcell) has 16.5% of which 67% missing values in rows we are going to drop.
  5. KBA05_ALTER1, KBA05_ALTER2, KBA05_ALTER3 and KBA05_ALTER4 (share of car owners between X and Y years old) has 16.6% of which 67.4% missing values are in rows we are going to drop.

Building Features

In [63]:
explore_category("Building", azdias_new)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Number of features in Building: 9


KONSUMNAEHE
distance from a building to PoS (Point of Sale)
Null percentage: 0.08299737102245122
1.0    193738
3.0    171127
5.0    153535
2.0    134665
4.0    133324
6.0     26625
7.0      4238
Name: KONSUMNAEHE, dtype: int64

ANZ_HAUSHALTE_AKTIV
number of households known in this building
Null percentage: 0.10451728583594866
1.0      195957
2.0      120982
3.0       62575
4.0       43213
5.0       37815
6.0       36020
7.0       34526
8.0       32293
9.0       29002
10.0      25428
11.0      21965
12.0      18033
13.0      15282
14.0      12625
15.0      10371
16.0       8899
17.0       7292
0.0        6463
18.0       6324
19.0       5461
20.0       4674
21.0       4138
22.0       3735
23.0       3243
24.0       2838
25.0       2636
26.0       2342
27.0       2232
28.0       2040
29.0       1963
          ...  
285.0         4
515.0         4
523.0         4
301.0         4
249.0         4
174.0         4
266.0         4
256.0         4
255.0         4
250.0         4
260.0         4
331.0         4
226.0         3
224.0         3
168.0         3
307.0         3
414.0         3
244.0         3
378.0         3
293.0         3
272.0         3
395.0         3
237.0         2
254.0         2
404.0         2
213.0         2
366.0         1
536.0         1
232.0         1
220.0         1
Name: ANZ_HAUSHALTE_AKTIV, Length: 292, dtype: int64

GEBAEUDETYP
type of building (residential or commercial)
Null percentage: 0.10451728583594866
1.0    460465
3.0    178668
8.0    152476
2.0      4935
4.0       900
6.0       628
5.0         1
Name: GEBAEUDETYP, dtype: int64

KBA05_MODTEMP
Development of the most common car segment in the neighbourhood
Null percentage: 0.10451728583594866
3.0    267178
4.0    226782
1.0    151667
2.0     77576
5.0     65321
6.0      9549
Name: KBA05_MODTEMP, dtype: int64

MIN_GEBAEUDEJAHR
year the building was first mentioned in our database
Null percentage: 0.10451728583594866
1992.0    568776
1994.0     78835
1993.0     25488
1995.0     25464
1996.0     16611
1997.0     14464
2000.0      7382
2001.0      5877
1991.0      5811
2005.0      5553
1999.0      4413
1990.0      4408
2002.0      4216
1998.0      4097
2003.0      3356
2004.0      2935
2008.0      2197
2007.0      2156
1989.0      2046
2009.0      2016
2006.0      1984
2011.0      1903
2012.0      1861
2010.0      1410
2013.0      1230
1988.0      1027
2014.0      1001
2015.0       717
1987.0       470
2016.0       128
1986.0       125
1985.0       116
Name: MIN_GEBAEUDEJAHR, dtype: int64

OST_WEST_KZ
flag indicating the former GDR/FRG
Null percentage: 0.10451728583594866
W    629528
O    168545
Name: OST_WEST_KZ, dtype: int64

WOHNLAGE
neighbourhood-area (very good -> rather poor; rural nbh)
Null percentage: 0.10451728583594866
3.0    249719
7.0    169318
4.0    135973
2.0    100376
5.0     74346
1.0     43918
8.0     17473
0.0      6950
Name: WOHNLAGE, dtype: int64

ANZ_HH_TITEL
number of holders of an academic title in the building
Null percentage: 0.10884842255736793
0.0     770244
1.0      20157
2.0       2459
3.0        585
4.0        232
5.0        117
6.0        106
8.0         68
7.0         65
9.0         34
13.0        29
12.0        22
11.0        22
14.0        16
10.0        16
17.0        13
20.0         9
15.0         7
18.0         6
16.0         3
23.0         3
Name: ANZ_HH_TITEL, dtype: int64

KBA05_HERSTTEMP
Development of the most common car manufacturers in the neighbourhood
Null percentage: 0.12346769207637612
3.0    275428
1.0    162386
2.0    157856
4.0    120193
5.0     65321
Name: KBA05_HERSTTEMP, dtype: int64

Notes:

  1. KONSUMNAEHE (distance from a building to PoS (Point of Sale)) has 8% missing values that can't be imputed, and fortunately 99% of these missing values are in the rows we are dropping.
  2. The rest of the features have 100% of it's missing values in the rows we are dropping, except KBA05_HERSTTEMP (Development of the most common car manufacturers in the neighbourhood) which has 85%.

RR1_ID Features

In [64]:
explore_category("RR1_ID", azdias_new)
Number of features in RR1_ID: 5


ONLINE_AFFINITAET
online affinity
Null percentage: 0.005446460529992
2.0    197850
4.0    164704
3.0    163487
1.0    156499
5.0    138111
0.0     65716
Name: ONLINE_AFFINITAET, dtype: int64

GEBAEUDETYP_RASTER
industrial areas
Null percentage: 0.10452514022896678
4.0    359620
3.0    205330
5.0    159217
2.0     58961
1.0     14938
Name: GEBAEUDETYP_RASTER, dtype: int64

MOBI_REGIO
moving patterns
Null percentage: 0.14959701353536328
1.0    163993
3.0    150336
5.0    148713
4.0    148209
2.0    146305
6.0       341
Name: MOBI_REGIO, dtype: int64

KKK
purchasing power
Null percentage: 0.17735668257368262
3.0    273024
2.0    181519
4.0    178648
1.0     99966
Name: KKK, dtype: int64

REGIOTYP
AZ neighbourhood typology
Null percentage: 0.17735668257368262
6.0    195286
5.0    145359
3.0     93929
2.0     91662
7.0     83943
4.0     68180
1.0     54798
Name: REGIOTYP, dtype: int64

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  

Notes:

  1. ONLINE_AFFINITAET (online affinity) has only 0.5% missing values.
  2. GEBAEUDETYP_RASTER (industrial areas) has 10.4% missing values of which 99% exist in the rows we are dropping.
  3. MOBI_REGIO (moving patterns) has 15% missing values of which 74% exist in the rows we are dropping.
  4. KKK (purchasing power) and REGIOTYP (AZ neighbourhood typology) have 17.7% missing values of which 61% exist in the rows we are dropping.

Postcode Features

In [65]:
explore_category("Postcode ", azdias_new)
Number of features in Postcode : 3


BALLRAUM
distance to the next metropole
Null percentage: 0.10518154307405234
6.0    255093
1.0    151782
2.0    104521
7.0     99039
3.0     73277
4.0     61358
5.0     52411
Name: BALLRAUM, dtype: int64

EWDICHTE
density of inhabitants per square kilometer
Null percentage: 0.10518154307405234
6.0    201009
5.0    161209
2.0    139087
4.0    130716
1.0     84051
3.0     81409
Name: EWDICHTE, dtype: int64

INNENSTADT
distance to the city centre
Null percentage: 0.10518154307405234
5.0    147626
4.0    134067
6.0    111679
2.0    109048
3.0     92818
8.0     82870
7.0     67463
1.0     51910
Name: INNENSTADT, dtype: int64

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  

Notes:

  1. All features have 10.5% missing values of which 99% exist in the rows that we are dropping.

Community Featuers

In [66]:
explore_category("Community", azdias_new)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Number of features in Community: 3


ARBEIT
share of unemployed person in the community
Null percentage: 0.10908181023562057
4.0    311339
3.0    254988
2.0    135662
1.0     56767
5.0     35090
9.0       159
Name: ARBEIT, dtype: int64

ORTSGR_KLS9
classified number of inhabitants
Null percentage: 0.10908181023562057
5.0    148096
4.0    114909
7.0    102866
9.0     91879
3.0     83542
6.0     75995
8.0     72709
2.0     63362
1.0     40589
0.0        58
Name: ORTSGR_KLS9, dtype: int64

RELAT_AB
share of unemployed in relation to the county the community belongs to
Null percentage: 0.10908181023562057
3.0    274008
5.0    174964
1.0    142907
2.0    104846
4.0     97121
9.0       159
Name: RELAT_AB, dtype: int64

Notes:

  1. All features have 10.9% missing values of which 95.8% are in rows that we are dropping.

Unknown Features

Since we have no information about the feature description, let's look through them looking for what they mean and try to find which features are worth keeping and which are not.

In [67]:
# find features that don't have categories and transpose df so that features are index
no_cat_feats = azdias_new[list(set(azdias_feats).difference(dias_atts_feats))].T

# calculate null_percentage of features
no_cat_feats["null_percentage"] = no_cat_feats.isna().sum(axis=1)/no_cat_feats.shape[1]

# sort by null percentage
no_cat_feats.sort_values("null_percentage")

print(f"Number of features of unknown category: {len(no_cat_feats)}\n\n")
for feat, p in no_cat_feats["null_percentage"].iteritems():
    print(feat)
    print("Null percentage:", p)
    print(azdias_new[feat].value_counts())
    print()
Number of features of unknown category: 102


D19_ENERGIE
Null percentage: 0.9311461466908881
6.0    25788
3.0    14572
5.0     9556
7.0     7655
2.0     1967
4.0     1185
1.0      641
Name: D19_ENERGIE, dtype: int64

VHN
Null percentage: 0.17735668257368262
2.0    233844
3.0    179579
4.0    178413
1.0    141321
Name: VHN, dtype: int64

DSL_FLAG
Null percentage: 0.10451728583594866
1.0    772388
0.0     25685
Name: DSL_FLAG, dtype: int64

D19_BIO_OEKO
Null percentage: 0.9583189803651395
6.0    17732
7.0    15241
5.0     2121
3.0     1926
4.0       83
2.0       42
1.0        2
Name: D19_BIO_OEKO, dtype: int64

D19_LETZTER_KAUF_BRANCHE
Null percentage: 0.2884952217239046
D19_UNBEKANNT             195338
D19_VERSICHERUNGEN         57734
D19_SONSTIGE               44722
D19_VOLLSORTIMENT          34812
D19_SCHUHE                 32578
D19_BUCH_CD                28754
D19_VERSAND_REST           26034
D19_DROGERIEARTIKEL        24072
D19_BANKEN_DIREKT          23273
D19_BEKLEIDUNG_REST        21796
D19_HAUS_DEKO              20858
D19_TELKO_MOBILE           14447
D19_ENERGIE                12084
D19_TELKO_REST             11472
D19_BANKEN_GROSS           10550
D19_BEKLEIDUNG_GEH         10272
D19_KINDERARTIKEL           7301
D19_FREIZEIT                7257
D19_TECHNIK                 7002
D19_LEBENSMITTEL            6458
D19_BANKEN_REST             5247
D19_RATGEBER                4931
D19_NAHRUNGSERGAENZUNG      4061
D19_DIGIT_SERV              3577
D19_REISEN                  3122
D19_TIERARTIKEL             2578
D19_SAMMELARTIKEL           2443
D19_HANDWERK                2227
D19_WEIN_FEINKOST           2164
D19_GARTEN                  1646
D19_BANKEN_LOKAL            1442
D19_BIO_OEKO                1232
D19_BILDUNG                  980
D19_LOTTO                    839
D19_KOSMETIK                 805
Name: D19_LETZTER_KAUF_BRANCHE, dtype: int64

CJT_TYP_2
Null percentage: 0.005446460529992
5.0    233302
2.0    205425
3.0    173910
4.0    153045
1.0    120685
Name: CJT_TYP_2, dtype: int64

D19_BANKEN_GROSS
Null percentage: 0.8812079158816949
6.0    57103
3.0    14862
5.0    13911
4.0     8165
1.0     6347
2.0     5482
Name: D19_BANKEN_GROSS, dtype: int64

KBA13_ANTG2
Null percentage: 0.13080032898686184
3.0    325207
2.0    207993
4.0    182916
1.0     58533
Name: KBA13_ANTG2, dtype: int64

EINGEZOGENAM_HH_JAHR
Null percentage: 0.08247000463409188
1994.0    111439
1997.0     66259
2015.0     45926
2004.0     45103
2014.0     43992
2001.0     41440
2008.0     34959
2005.0     34714
2002.0     34547
2012.0     33580
2000.0     32267
1999.0     31991
2007.0     31105
2013.0     30982
1998.0     29590
2011.0     25973
2009.0     22886
1996.0     22039
2003.0     21716
2006.0     21378
2010.0     21240
1995.0     18195
2016.0     13699
1993.0       871
2018.0       565
1992.0       564
2017.0       208
1991.0       205
1990.0       158
1989.0        72
1988.0        28
1987.0        19
1986.0         8
1984.0         1
1971.0         1
1904.0         1
1900.0         1
Name: EINGEZOGENAM_HH_JAHR, dtype: int64

CJT_TYP_3
Null percentage: 0.005446460529992
5.0    270143
2.0    181803
3.0    170162
4.0    160469
1.0    103790
Name: CJT_TYP_3, dtype: int64

ALTER_KIND2
Null percentage: 0.9669004657655059
18.0    3128
14.0    3111
17.0    3085
15.0    3083
16.0    3010
13.0    2968
12.0    2628
11.0    2450
10.0    1953
9.0     1641
8.0     1179
7.0      627
6.0      396
5.0      154
4.0       67
3.0       15
2.0        4
Name: ALTER_KIND2, dtype: int64

UNGLEICHENN_FLAG
Null percentage: 0.08247000463409188
0.0    744072
1.0     73650
Name: UNGLEICHENN_FLAG, dtype: int64

D19_TELKO_MOBILE
Null percentage: 0.8155148947343027
6.0    116433
3.0     16429
5.0     14257
7.0      8222
4.0      3573
2.0      3558
1.0      1945
Name: D19_TELKO_MOBILE, dtype: int64

D19_FREIZEIT
Null percentage: 0.8872636528986637
6.0    55056
3.0    15297
5.0    13587
7.0     8514
4.0     3990
2.0     2676
1.0     1353
Name: D19_FREIZEIT, dtype: int64

KBA13_BAUMAX
Null percentage: 0.11871354018812394
1.0    491118
5.0    115476
2.0     69249
3.0     59060
4.0     50518
Name: KBA13_BAUMAX, dtype: int64

KBA13_ANTG1
Null percentage: 0.12769896580085074
2.0    299448
3.0    219174
1.0    200723
4.0     58068
Name: KBA13_ANTG1, dtype: int64

D19_VERSICHERUNGEN
Null percentage: 0.7345697644018712
6.0    116559
3.0     44933
5.0     30998
2.0     14892
4.0     13254
1.0     10107
7.0      5814
Name: D19_VERSICHERUNGEN, dtype: int64

D19_BEKLEIDUNG_GEH
Null percentage: 0.9080845267335487
6.0    39392
3.0    15239
5.0    11899
7.0     8478
2.0     3117
4.0     2013
1.0     1779
Name: D19_BEKLEIDUNG_GEH, dtype: int64

ALTER_KIND3
Null percentage: 0.9930769135826019
18.0    866
15.0    847
16.0    841
17.0    826
14.0    746
13.0    674
12.0    438
11.0    363
10.0    237
9.0     159
8.0     102
7.0      40
6.0      21
5.0       8
4.0       2
Name: ALTER_KIND3, dtype: int64

UMFELD_ALT
Null percentage: 0.10972138223852446
4.0    228222
3.0    208733
5.0    135160
2.0    121133
1.0    100187
Name: UMFELD_ALT, dtype: int64

D19_VERSI_ANZ_12
Null percentage: 0.0
0    821289
1     44933
2     20273
3      3335
4      1210
5       170
6        11
Name: D19_VERSI_ANZ_12, dtype: int64

D19_LOTTO
Null percentage: 0.839248626322764
7.0    113486
6.0     25736
5.0      2011
3.0      1869
4.0        78
2.0        66
1.0        19
Name: D19_LOTTO, dtype: int64

CJT_TYP_6
Null percentage: 0.005446460529992
5.0    272208
4.0    193430
2.0    175179
3.0    170731
1.0     74819
Name: CJT_TYP_6, dtype: int64

ANZ_STATISTISCHE_HAUSHALTE
Null percentage: 0.10456553425020282
1.0      219119
2.0      121485
3.0       61478
4.0       44864
5.0       40133
6.0       38466
7.0       36589
8.0       32816
9.0       28500
10.0      24100
11.0      19461
12.0      16296
13.0      13040
14.0      10616
15.0       8939
16.0       7314
17.0       6136
18.0       5220
19.0       4465
20.0       3840
21.0       3433
22.0       3221
23.0       2633
24.0       2485
25.0       2384
26.0       2042
27.0       1995
28.0       1930
29.0       1712
30.0       1661
          ...  
257.0         5
173.0         5
205.0         5
258.0         5
218.0         4
241.0         4
209.0         4
239.0         4
216.0         4
203.0         4
262.0         4
198.0         4
309.0         4
284.0         4
371.0         3
229.0         3
182.0         3
449.0         3
289.0         3
245.0         3
248.0         3
190.0         2
189.0         2
227.0         2
336.0         2
197.0         2
175.0         2
165.0         2
133.0         1
314.0         1
Name: ANZ_STATISTISCHE_HAUSHALTE, Length: 267, dtype: int64

ALTERSKATEGORIE_FEIN
Null percentage: 0.3412565457950385
15.0    63486
14.0    59709
16.0    53384
18.0    51365
17.0    50011
13.0    49556
12.0    42951
19.0    42340
10.0    34903
11.0    33061
20.0    27833
9.0     26204
8.0     14516
21.0    13658
7.0      8578
6.0      3754
22.0     3669
23.0     2838
24.0     2340
25.0     1017
5.0       994
4.0       636
3.0       218
2.0        64
1.0         1
Name: ALTERSKATEGORIE_FEIN, dtype: int64

D19_VERSAND_ANZ_24
Null percentage: 0.0
0    563818
2     93666
1     90253
4     55016
3     47832
5     30398
6     10238
Name: D19_VERSAND_ANZ_24, dtype: int64

LNR
Null percentage: 0.0
192418     1
611455     1
982154     1
976009     1
978056     1
955527     1
957574     1
951429     1
953476     1
963715     1
965762     1
959617     1
961664     1
613502     1
1021095    1
607357     1
609404     1
619643     1
621690     1
615545     1
617592     1
595063     1
597110     1
590965     1
593012     1
603251     1
980107     1
969868     1
967821     1
973966     1
          ..
626587     1
624538     1
581553     1
577459     1
735180     1
587700     1
724939     1
722890     1
729033     1
726984     1
749511     1
747462     1
753605     1
751556     1
741315     1
739266     1
745409     1
743360     1
569279     1
567230     1
573373     1
571324     1
561083     1
559034     1
565177     1
563128     1
585655     1
583606     1
589749     1
192606     1
Name: LNR, Length: 891221, dtype: int64

D19_TECHNIK
Null percentage: 0.7070086993012956
6.0    190979
7.0     41174
5.0     14389
3.0     11064
4.0      1846
2.0      1163
1.0       505
Name: D19_TECHNIK, dtype: int64

ANZ_KINDER
Null percentage: 0.9029645845418813
1.0     55350
2.0     24445
3.0      5376
4.0      1057
5.0       190
6.0        47
7.0        10
9.0         3
11.0        1
8.0         1
Name: ANZ_KINDER, dtype: int64

AKT_DAT_KL
Null percentage: 0.08247000463409188
1.0    390258
9.0    270663
5.0     29203
6.0     27655
3.0     24880
4.0     21466
7.0     21026
8.0     17485
2.0     15086
Name: AKT_DAT_KL, dtype: int64

D19_NAHRUNGSERGAENZUNG
Null percentage: 0.9561893178010842
6.0    15778
7.0    11596
3.0     5768
5.0     4243
2.0      679
1.0      572
4.0      409
Name: D19_NAHRUNGSERGAENZUNG, dtype: int64

D19_TELKO_ONLINE_QUOTE_12
Null percentage: 0.9991158197573891
10.0    767
5.0      19
7.0       1
3.0       1
Name: D19_TELKO_ONLINE_QUOTE_12, dtype: int64

CJT_KATALOGNUTZER
Null percentage: 0.005446460529992
5.0    281804
4.0    174275
1.0    167426
3.0    156998
2.0    105864
Name: CJT_KATALOGNUTZER, dtype: int64

HH_DELTA_FLAG
Null percentage: 0.12073548536221655
0.0    710942
1.0     72677
Name: HH_DELTA_FLAG, dtype: int64

FIRMENDICHTE
Null percentage: 0.10452514022896678
4.0    273637
3.0    181608
5.0    159217
2.0    139078
1.0     44526
Name: FIRMENDICHTE, dtype: int64

D19_BANKEN_ANZ_24
Null percentage: 0.0
0    794100
1     43554
2     29079
3     10214
4      9041
5      3930
6      1303
Name: D19_BANKEN_ANZ_24, dtype: int64

UMFELD_JUNG
Null percentage: 0.10972138223852446
5.0    350532
4.0    225939
3.0    130403
2.0     53460
1.0     33101
Name: UMFELD_JUNG, dtype: int64

D19_HAUS_DEKO
Null percentage: 0.8001382373171189
6.0    100720
3.0     31024
5.0     20748
7.0     10278
2.0      7476
1.0      4133
4.0      3742
Name: D19_HAUS_DEKO, dtype: int64

D19_KINDERARTIKEL
Null percentage: 0.8408296034316965
6.0    79320
7.0    22597
3.0    16402
5.0    13398
2.0     5842
4.0     3224
1.0     1073
Name: D19_KINDERARTIKEL, dtype: int64

D19_TIERARTIKEL
Null percentage: 0.9562386882714837
6.0    20168
7.0     7945
3.0     6030
5.0     3809
2.0      553
4.0      455
1.0       41
Name: D19_TIERARTIKEL, dtype: int64

D19_TELKO_REST
Null percentage: 0.8594647118952539
6.0    88346
5.0    15301
3.0    11467
7.0     5905
4.0     2402
2.0     1397
1.0      430
Name: D19_TELKO_REST, dtype: int64

KBA13_GBZ
Null percentage: 0.11871354018812394
3.0    284563
4.0    184003
5.0    167473
2.0    109422
1.0     39960
Name: KBA13_GBZ, dtype: int64

D19_GESAMT_ANZ_24
Null percentage: 0.0
0    505303
2    101785
1     86493
4     74210
3     58554
5     46547
6     18329
Name: D19_GESAMT_ANZ_24, dtype: int64

D19_TELKO_ANZ_24
Null percentage: 0.0
0    826208
1     46520
2     15343
3      2055
4       844
5       197
6        54
Name: D19_TELKO_ANZ_24, dtype: int64

KONSUMZELLE
Null percentage: 0.10452514022896678
0.0    609591
1.0    188475
Name: KONSUMZELLE, dtype: int64

KBA13_CCM_1401_2500
Null percentage: 0.11871354018812394
3.0    359093
2.0    174525
4.0    157962
1.0     60428
5.0     33413
Name: KBA13_CCM_1401_2500, dtype: int64

RT_KEIN_ANREIZ
Null percentage: 0.005446460529992
5.0    211534
4.0    206707
3.0    186655
1.0    141140
2.0    140331
Name: RT_KEIN_ANREIZ, dtype: int64

D19_VOLLSORTIMENT
Null percentage: 0.6732359313795344
6.0    173626
3.0     44589
5.0     31303
7.0     23231
2.0      8250
4.0      6366
1.0      3854
Name: D19_VOLLSORTIMENT, dtype: int64

EINGEFUEGT_AM
Null percentage: 0.10451728583594866
1992-02-10 00:00:00    383738
1992-02-12 00:00:00    192264
1995-02-07 00:00:00     11181
2005-12-16 00:00:00      6291
2003-11-18 00:00:00      6050
1993-03-01 00:00:00      3204
2005-04-15 00:00:00      2343
2000-05-10 00:00:00      2327
2004-04-14 00:00:00      2290
2005-08-23 00:00:00      2078
1992-02-21 00:00:00      2074
1994-02-03 00:00:00      2032
1995-10-17 00:00:00      2022
1995-10-10 00:00:00      1944
1993-09-21 00:00:00      1819
1993-09-22 00:00:00      1484
1993-04-01 00:00:00      1298
1993-04-02 00:00:00      1292
1994-12-13 00:00:00      1171
1996-03-07 00:00:00      1151
1995-10-18 00:00:00      1139
1995-07-19 00:00:00      1117
2006-01-19 00:00:00      1116
2005-04-12 00:00:00      1093
2003-11-17 00:00:00      1051
1993-10-21 00:00:00      1046
1996-05-09 00:00:00      1021
1995-08-02 00:00:00      1005
1993-09-23 00:00:00      1005
1995-08-15 00:00:00       988
                        ...  
2003-05-18 00:00:00         1
2015-01-14 00:00:00         1
2010-07-30 00:00:00         1
2014-10-27 00:00:00         1
2015-02-19 00:00:00         1
2005-08-15 00:00:00         1
2012-07-19 00:00:00         1
2014-07-25 00:00:00         1
2011-04-12 00:00:00         1
2015-06-08 00:00:00         1
2004-10-24 00:00:00         1
1993-05-10 00:00:00         1
2010-09-08 00:00:00         1
2010-04-27 00:00:00         1
2011-10-03 00:00:00         1
2005-05-02 00:00:00         1
2016-03-21 00:00:00         1
1995-01-27 00:00:00         1
2015-09-21 00:00:00         1
2008-12-28 00:00:00         1
2001-08-23 00:00:00         1
2011-02-07 00:00:00         1
2005-05-26 00:00:00         1
2013-07-24 00:00:00         1
2011-07-20 00:00:00         1
2013-09-26 00:00:00         1
2011-10-04 00:00:00         1
2014-09-11 00:00:00         1
2011-02-06 00:00:00         1
1998-12-03 00:00:00         1
Name: EINGEFUEGT_AM, Length: 5162, dtype: int64

D19_BANKEN_LOKAL
Null percentage: 0.9815130029476415
7.0    8522
3.0    3500
6.0    3202
5.0    1053
2.0     118
4.0      69
1.0      12
Name: D19_BANKEN_LOKAL, dtype: int64

D19_GARTEN
Null percentage: 0.9555721869210891
6.0    20410
7.0     9555
5.0     4979
3.0     4003
4.0      328
2.0      265
1.0       55
Name: D19_GARTEN, dtype: int64

RT_SCHNAEPPCHEN
Null percentage: 0.005446460529992
5.0    402504
4.0    182059
3.0    133538
2.0    115106
1.0     53160
Name: RT_SCHNAEPPCHEN, dtype: int64

D19_BILDUNG
Null percentage: 0.9124066870058044
6.0    37502
7.0    21828
2.0     8582
3.0     5127
5.0     3363
4.0     1288
1.0      375
Name: D19_BILDUNG, dtype: int64

D19_SONSTIGE
Null percentage: 0.5677076729565393
6.0    220478
7.0     58373
3.0     44578
5.0     35028
2.0     10816
4.0     10204
1.0      5791
Name: D19_SONSTIGE, dtype: int64

VK_DHT4A
Null percentage: 0.08518313639377888
10.0    114500
7.0      96693
9.0      88238
8.0      86090
3.0      82570
6.0      80078
2.0      74200
5.0      70671
4.0      70198
1.0      49535
11.0      2531
Name: VK_DHT4A, dtype: int64

D19_BEKLEIDUNG_REST
Null percentage: 0.7770261248332344
6.0    109025
3.0     30201
7.0     19637
5.0     18554
2.0      8748
1.0      7262
4.0      5292
Name: D19_BEKLEIDUNG_REST, dtype: int64

D19_RATGEBER
Null percentage: 0.9033348630698783
6.0    44707
7.0    12287
3.0    10598
2.0     8334
5.0     7197
4.0     2136
1.0      891
Name: D19_RATGEBER, dtype: int64

ALTER_KIND1
Null percentage: 0.9090483729624863
18.0    6703
17.0    6394
8.0     6343
7.0     6249
16.0    6124
15.0    6008
14.0    5992
9.0     5846
13.0    5713
10.0    5678
12.0    5576
11.0    5506
6.0     4875
5.0     1501
4.0     1084
3.0     1063
2.0      403
Name: ALTER_KIND1, dtype: int64

KBA13_ANTG3
Null percentage: 0.26928225434544295
2.0    251723
1.0    220761
3.0    178747
Name: KBA13_ANTG3, dtype: int64

D19_DIGIT_SERV
Null percentage: 0.9623437957588522
6.0    17942
3.0     6153
7.0     4030
5.0     3225
2.0     1293
4.0      465
1.0      452
Name: D19_DIGIT_SERV, dtype: int64

D19_WEIN_FEINKOST
Null percentage: 0.9381982695650125
6.0    27556
7.0    20665
3.0     3460
5.0     2952
4.0      231
2.0      179
1.0       36
Name: D19_WEIN_FEINKOST, dtype: int64

GEMEINDETYP
Null percentage: 0.1091468894920564
22.0    151307
11.0    150715
40.0    125571
30.0    122406
12.0    120145
21.0     72777
50.0     51026
Name: GEMEINDETYP, dtype: int64

CJT_TYP_4
Null percentage: 0.005446460529992
5.0    254763
3.0    180975
2.0    180312
4.0    169791
1.0    100526
Name: CJT_TYP_4, dtype: int64

VHA
Null percentage: 0.8292511060668454
1.0    81016
4.0    24469
5.0    22372
3.0    19445
2.0     4873
Name: VHA, dtype: int64

CJT_TYP_5
Null percentage: 0.005446460529992
5.0    271673
3.0    194636
2.0    174808
4.0    147220
1.0     98030
Name: CJT_TYP_5, dtype: int64

D19_BANKEN_DIREKT
Null percentage: 0.8177668614182116
6.0    84798
3.0    27350
5.0    18539
4.0     8771
2.0     8119
7.0     7656
1.0     7177
Name: D19_BANKEN_DIREKT, dtype: int64

D19_LEBENSMITTEL
Null percentage: 0.9401865530547417
6.0    27626
3.0    10044
7.0     8318
5.0     5198
2.0     1315
4.0      409
1.0      397
Name: D19_LEBENSMITTEL, dtype: int64

D19_SOZIALES
Null percentage: 0.8560626376622633
4.0    36514
5.0    30414
1.0    25128
3.0    21483
2.0    14741
Name: D19_SOZIALES, dtype: int64

VK_ZG11
Null percentage: 0.08518313639377888
10.0    97938
5.0     97777
6.0     88581
7.0     88552
4.0     86600
8.0     83994
9.0     82134
3.0     69634
2.0     59916
1.0     52009
11.0     8169
Name: VK_ZG11, dtype: int64

VK_DISTANZ
Null percentage: 0.08518313639377888
10.0    94320
8.0     92104
9.0     89067
7.0     84222
6.0     83291
11.0    79990
3.0     71410
12.0    59994
1.0     44858
5.0     38791
4.0     29786
13.0    27554
2.0     19917
Name: VK_DISTANZ, dtype: int64

D19_BANKEN_ANZ_12
Null percentage: 0.0
0    831734
1     29771
2     18067
3      5708
4      4082
5      1483
6       376
Name: D19_BANKEN_ANZ_12, dtype: int64

VERDICHTUNGSRAUM
Null percentage: 0.5229409989217041
1.0     111235
2.0      47613
3.0      29827
4.0      26996
5.0      24019
6.0      21882
7.0      13238
8.0      11864
10.0     11034
9.0       9425
13.0      8707
11.0      8226
14.0      8180
12.0      8046
15.0      6942
16.0      6435
17.0      5502
18.0      5061
20.0      3538
22.0      3492
21.0      3364
19.0      3300
23.0      3239
24.0      2980
25.0      2887
30.0      2648
27.0      2622
26.0      2569
29.0      2552
28.0      2454
32.0      2390
31.0      2313
33.0      2240
34.0      2054
36.0      1959
35.0      1769
39.0      1660
38.0      1622
44.0      1435
40.0      1359
37.0      1348
41.0      1329
42.0      1324
43.0      1321
45.0      1165
Name: VERDICHTUNGSRAUM, dtype: int64

D19_VERSI_ONLINE_QUOTE_12
Null percentage: 0.9981530955845969
10.0    1548
5.0       70
7.0       11
3.0        9
8.0        6
6.0        1
9.0        1
Name: D19_VERSI_ONLINE_QUOTE_12, dtype: int64

KBA13_ANTG4
Null percentage: 0.5459869100930073
1.0    277982
2.0    126644
Name: KBA13_ANTG4, dtype: int64

D19_TELKO_ANZ_12
Null percentage: 0.0
0    857990
1     24868
2      6954
3       865
4       406
5       103
6        35
Name: D19_TELKO_ANZ_12, dtype: int64

D19_BANKEN_REST
Null percentage: 0.9220608580812166
6.0    43143
5.0     7744
7.0     7339
3.0     5943
2.0     2928
4.0     1448
1.0      916
Name: D19_BANKEN_REST, dtype: int64

KK_KUNDENTYP
Null percentage: 0.6559674873011295
3.0    65151
2.0    62564
5.0    48038
4.0    44512
6.0    44114
1.0    42230
Name: KK_KUNDENTYP, dtype: int64

D19_SAMMELARTIKEL
Null percentage: 0.8999844034195783
6.0    71297
7.0    10067
5.0     4072
3.0     3016
4.0      470
2.0      172
1.0       42
Name: D19_SAMMELARTIKEL, dtype: int64

KBA13_HHZ
Null percentage: 0.11871354018812394
3.0    319964
4.0    212119
5.0    168143
2.0     72085
1.0     13110
Name: KBA13_HHZ, dtype: int64

STRUKTURTYP
Null percentage: 0.1091468894920564
3.0    555713
1.0    127607
2.0    110627
Name: STRUKTURTYP, dtype: int64

D19_VERSAND_REST
Null percentage: 0.8240851595732147
6.0    69248
3.0    39251
5.0    25227
2.0     7560
7.0     5770
4.0     4901
1.0     4822
Name: D19_VERSAND_REST, dtype: int64

KBA13_CCM_3000
Null percentage: 0.11871354018812394
3.0    308099
1.0    149271
2.0    103136
4.0     92197
5.0     76156
0.0     56562
Name: KBA13_CCM_3000, dtype: int64

RT_UEBERGROESSE
Null percentage: 0.08525831415552372
5.0    188774
4.0    168467
3.0    155865
2.0    152524
1.0    149607
Name: RT_UEBERGROESSE, dtype: int64

D19_VERSI_ANZ_24
Null percentage: 0.0
0    777037
1     63340
2     37144
3      8848
4      4048
5       707
6        97
Name: D19_VERSI_ANZ_24, dtype: int64

D19_REISEN
Null percentage: 0.8268701029262102
6.0    94123
7.0    45315
3.0     5758
5.0     5149
2.0     2975
4.0      890
1.0       87
Name: D19_REISEN, dtype: int64

MOBI_RASTER
Null percentage: 0.10451728583594866
1.0    355579
3.0    124055
2.0    118231
4.0     92804
5.0     81375
6.0     26029
Name: MOBI_RASTER, dtype: int64

CJT_TYP_1
Null percentage: 0.005446460529992
5.0    267488
2.0    196545
3.0    171838
4.0    162940
1.0     87556
Name: CJT_TYP_1, dtype: int64

CAMEO_INTL_2015
Null percentage: 0.11105999522004081
51      77576
51.0    56118
41      53459
24      52882
41.0    38877
24.0    38276
14      36524
43      32730
14.0    26360
54      26207
43.0    23942
25      22837
54.0    19184
22      19173
25.0    16791
23      15653
13      15272
45      15206
22.0    13982
55      13842
52      11836
23.0    11097
13.0    11064
31      11041
45.0    10926
34      10737
55.0    10113
15       9832
52.0     8706
44       8543
31.0     7983
34.0     7787
12       7645
15.0     7142
44.0     6277
35       6090
32       6067
33       5833
12.0     5604
32.0     4287
35.0     4266
33.0     4102
XX        373
Name: CAMEO_INTL_2015, dtype: int64

D19_HANDWERK
Null percentage: 0.8621666230934864
6.0    90968
7.0    25319
5.0     3810
3.0     2465
4.0      184
2.0       87
1.0        7
Name: D19_HANDWERK, dtype: int64

D19_SCHUHE
Null percentage: 0.8673763297767894
3.0    46487
6.0    20589
5.0    19947
2.0    17071
7.0     5632
1.0     4976
4.0     3495
Name: D19_SCHUHE, dtype: int64

KOMBIALTER
Null percentage: 0.0
4    272770
3    246214
2    183764
1     94779
9     93694
Name: KOMBIALTER, dtype: int64

KBA13_KMH_210
Null percentage: 0.11871354018812394
3.0    361259
4.0    164843
2.0    161113
5.0     55495
1.0     42711
Name: KBA13_KMH_210, dtype: int64

D19_GESAMT_ANZ_12
Null percentage: 0.0
0    584797
1     99465
2     97282
3     45685
4     43579
5     16966
6      3447
Name: D19_GESAMT_ANZ_12, dtype: int64

D19_BUCH_CD
Null percentage: 0.6988031027096534
6.0    188263
3.0     26192
5.0     15900
1.0     11735
2.0      9187
7.0      8853
4.0      8303
Name: D19_BUCH_CD, dtype: int64

D19_KONSUMTYP_MAX
Null percentage: 0.0
8    260285
9    257113
1    144570
2     91423
4     75752
3     62078
Name: D19_KONSUMTYP_MAX, dtype: int64

ALTER_KIND4
Null percentage: 0.9986479223447383
17.0    225
18.0    216
15.0    171
16.0    159
14.0    136
13.0    119
12.0     59
11.0     48
10.0     42
9.0      15
8.0      14
7.0       1
Name: ALTER_KIND4, dtype: int64

D19_KOSMETIK
Null percentage: 0.8368698672944197
6.0    89353
7.0    52858
5.0     1292
3.0     1263
4.0      269
2.0      249
1.0      101
Name: D19_KOSMETIK, dtype: int64

D19_DROGERIEARTIKEL
Null percentage: 0.8539004354699901
6.0    52060
3.0    24763
5.0    17548
4.0    10951
2.0     9928
7.0     7841
1.0     7116
Name: D19_DROGERIEARTIKEL, dtype: int64

KBA13_CCM_3001
Null percentage: 0.11871354018812394
1.0    338439
4.0    215760
3.0    147448
5.0     83682
2.0        92
Name: KBA13_CCM_3001, dtype: int64

EXTSEL992
Null percentage: 0.7339963937115486
56.0    19722
31.0    14987
27.0    13269
38.0    12856
23.0    12742
36.0    12059
35.0    11308
55.0     9812
34.0     8583
50.0     6435
53.0     5686
37.0     5211
21.0     5114
54.0     4857
6.0      4815
41.0     4517
19.0     4445
29.0     4332
18.0     4315
39.0     4253
33.0     4199
25.0     4095
20.0     4069
26.0     3087
32.0     3041
15.0     2917
48.0     2916
17.0     2868
14.0     2832
40.0     2787
3.0      2783
2.0      2701
43.0     2602
46.0     2550
22.0     2244
24.0     2238
47.0     1659
1.0      1526
4.0      1468
13.0     1458
30.0     1457
5.0      1437
52.0     1415
16.0     1147
45.0     1087
12.0     1027
9.0       983
42.0      912
10.0      866
11.0      709
51.0      674
8.0       642
7.0       546
44.0      447
49.0      251
28.0      110
Name: EXTSEL992, dtype: int64

D19_VERSAND_ANZ_12
Null percentage: 0.0
0    637972
1     96577
2     81616
3     34258
4     29393
5      9712
6      1693
Name: D19_VERSAND_ANZ_12, dtype: int64

SOHO_KZ
Null percentage: 0.08247000463409188
0.0    810834
1.0      6888
Name: SOHO_KZ, dtype: int64

From looking at the first feature, I can see that there is alot of values that are missing encoded as 0, and since these features weren't included in the Values sheet, their unknown values wasn't changed to null when we changed the rest of the features. So let's change them and take a look again.

In [68]:
# find features that don't have categories
no_cat_feats = azdias_new[list(set(azdias_feats).difference(dias_atts_feats))].copy()

# change 0 and -1 to null in non-binary features
for feat in no_cat_feats.columns:
    if no_cat_feats[feat].nunique() > 2:
        no_cat_feats[feat].replace(-1, np.nan, inplace=True)
        no_cat_feats[feat].replace(0, np.nan, inplace=True)
        
# transpose df so that features are index
no_cat_feats = no_cat_feats.T

# calculate null_percentage of features
no_cat_feats["null_percentage"] = no_cat_feats.isna().sum(axis=1)/no_cat_feats.shape[1]

# sort by null percentage
no_cat_feats.sort_values("null_percentage", inplace=True)

print(f"Number of features of unknown category: {len(no_cat_feats)}\n\n")
for feat, p in no_cat_feats["null_percentage"].iteritems():
    print(feat)
    print("Null percentage:", p)
    print(azdias_new[feat].value_counts())
    print()
Number of features of unknown category: 102


KOMBIALTER
Null percentage: 0.0
4    272770
3    246214
2    183764
1     94779
9     93694
Name: KOMBIALTER, dtype: int64

D19_KONSUMTYP_MAX
Null percentage: 0.0
8    260285
9    257113
1    144570
2     91423
4     75752
3     62078
Name: D19_KONSUMTYP_MAX, dtype: int64

LNR
Null percentage: 0.0
192418     1
611455     1
982154     1
976009     1
978056     1
955527     1
957574     1
951429     1
953476     1
963715     1
965762     1
959617     1
961664     1
613502     1
1021095    1
607357     1
609404     1
619643     1
621690     1
615545     1
617592     1
595063     1
597110     1
590965     1
593012     1
603251     1
980107     1
969868     1
967821     1
973966     1
          ..
626587     1
624538     1
581553     1
577459     1
735180     1
587700     1
724939     1
722890     1
729033     1
726984     1
749511     1
747462     1
753605     1
751556     1
741315     1
739266     1
745409     1
743360     1
569279     1
567230     1
573373     1
571324     1
561083     1
559034     1
565177     1
563128     1
585655     1
583606     1
589749     1
192606     1
Name: LNR, Length: 891221, dtype: int64

CJT_KATALOGNUTZER
Null percentage: 0.005446460529992
5.0    281804
4.0    174275
1.0    167426
3.0    156998
2.0    105864
Name: CJT_KATALOGNUTZER, dtype: int64

RT_KEIN_ANREIZ
Null percentage: 0.005446460529992
5.0    211534
4.0    206707
3.0    186655
1.0    141140
2.0    140331
Name: RT_KEIN_ANREIZ, dtype: int64

RT_SCHNAEPPCHEN
Null percentage: 0.005446460529992
5.0    402504
4.0    182059
3.0    133538
2.0    115106
1.0     53160
Name: RT_SCHNAEPPCHEN, dtype: int64

CJT_TYP_3
Null percentage: 0.005446460529992
5.0    270143
2.0    181803
3.0    170162
4.0    160469
1.0    103790
Name: CJT_TYP_3, dtype: int64

CJT_TYP_6
Null percentage: 0.005446460529992
5.0    272208
4.0    193430
2.0    175179
3.0    170731
1.0     74819
Name: CJT_TYP_6, dtype: int64

CJT_TYP_5
Null percentage: 0.005446460529992
5.0    271673
3.0    194636
2.0    174808
4.0    147220
1.0     98030
Name: CJT_TYP_5, dtype: int64

CJT_TYP_2
Null percentage: 0.005446460529992
5.0    233302
2.0    205425
3.0    173910
4.0    153045
1.0    120685
Name: CJT_TYP_2, dtype: int64

CJT_TYP_1
Null percentage: 0.005446460529992
5.0    267488
2.0    196545
3.0    171838
4.0    162940
1.0     87556
Name: CJT_TYP_1, dtype: int64

CJT_TYP_4
Null percentage: 0.005446460529992
5.0    254763
3.0    180975
2.0    180312
4.0    169791
1.0    100526
Name: CJT_TYP_4, dtype: int64

EINGEZOGENAM_HH_JAHR
Null percentage: 0.08247000463409188
1994.0    111439
1997.0     66259
2015.0     45926
2004.0     45103
2014.0     43992
2001.0     41440
2008.0     34959
2005.0     34714
2002.0     34547
2012.0     33580
2000.0     32267
1999.0     31991
2007.0     31105
2013.0     30982
1998.0     29590
2011.0     25973
2009.0     22886
1996.0     22039
2003.0     21716
2006.0     21378
2010.0     21240
1995.0     18195
2016.0     13699
1993.0       871
2018.0       565
1992.0       564
2017.0       208
1991.0       205
1990.0       158
1989.0        72
1988.0        28
1987.0        19
1986.0         8
1984.0         1
1971.0         1
1904.0         1
1900.0         1
Name: EINGEZOGENAM_HH_JAHR, dtype: int64

UNGLEICHENN_FLAG
Null percentage: 0.08247000463409188
0.0    744072
1.0     73650
Name: UNGLEICHENN_FLAG, dtype: int64

AKT_DAT_KL
Null percentage: 0.08247000463409188
1.0    390258
9.0    270663
5.0     29203
6.0     27655
3.0     24880
4.0     21466
7.0     21026
8.0     17485
2.0     15086
Name: AKT_DAT_KL, dtype: int64

SOHO_KZ
Null percentage: 0.08247000463409188
0.0    810834
1.0      6888
Name: SOHO_KZ, dtype: int64

VK_DHT4A
Null percentage: 0.08518313639377888
10.0    114500
7.0      96693
9.0      88238
8.0      86090
3.0      82570
6.0      80078
2.0      74200
5.0      70671
4.0      70198
1.0      49535
11.0      2531
Name: VK_DHT4A, dtype: int64

VK_ZG11
Null percentage: 0.08518313639377888
10.0    97938
5.0     97777
6.0     88581
7.0     88552
4.0     86600
8.0     83994
9.0     82134
3.0     69634
2.0     59916
1.0     52009
11.0     8169
Name: VK_ZG11, dtype: int64

VK_DISTANZ
Null percentage: 0.08518313639377888
10.0    94320
8.0     92104
9.0     89067
7.0     84222
6.0     83291
11.0    79990
3.0     71410
12.0    59994
1.0     44858
5.0     38791
4.0     29786
13.0    27554
2.0     19917
Name: VK_DISTANZ, dtype: int64

RT_UEBERGROESSE
Null percentage: 0.08525831415552372
5.0    188774
4.0    168467
3.0    155865
2.0    152524
1.0    149607
Name: RT_UEBERGROESSE, dtype: int64

EINGEFUEGT_AM
Null percentage: 0.10451728583594866
1992-02-10 00:00:00    383738
1992-02-12 00:00:00    192264
1995-02-07 00:00:00     11181
2005-12-16 00:00:00      6291
2003-11-18 00:00:00      6050
1993-03-01 00:00:00      3204
2005-04-15 00:00:00      2343
2000-05-10 00:00:00      2327
2004-04-14 00:00:00      2290
2005-08-23 00:00:00      2078
1992-02-21 00:00:00      2074
1994-02-03 00:00:00      2032
1995-10-17 00:00:00      2022
1995-10-10 00:00:00      1944
1993-09-21 00:00:00      1819
1993-09-22 00:00:00      1484
1993-04-01 00:00:00      1298
1993-04-02 00:00:00      1292
1994-12-13 00:00:00      1171
1996-03-07 00:00:00      1151
1995-10-18 00:00:00      1139
1995-07-19 00:00:00      1117
2006-01-19 00:00:00      1116
2005-04-12 00:00:00      1093
2003-11-17 00:00:00      1051
1993-10-21 00:00:00      1046
1996-05-09 00:00:00      1021
1995-08-02 00:00:00      1005
1993-09-23 00:00:00      1005
1995-08-15 00:00:00       988
                        ...  
2003-05-18 00:00:00         1
2015-01-14 00:00:00         1
2010-07-30 00:00:00         1
2014-10-27 00:00:00         1
2015-02-19 00:00:00         1
2005-08-15 00:00:00         1
2012-07-19 00:00:00         1
2014-07-25 00:00:00         1
2011-04-12 00:00:00         1
2015-06-08 00:00:00         1
2004-10-24 00:00:00         1
1993-05-10 00:00:00         1
2010-09-08 00:00:00         1
2010-04-27 00:00:00         1
2011-10-03 00:00:00         1
2005-05-02 00:00:00         1
2016-03-21 00:00:00         1
1995-01-27 00:00:00         1
2015-09-21 00:00:00         1
2008-12-28 00:00:00         1
2001-08-23 00:00:00         1
2011-02-07 00:00:00         1
2005-05-26 00:00:00         1
2013-07-24 00:00:00         1
2011-07-20 00:00:00         1
2013-09-26 00:00:00         1
2011-10-04 00:00:00         1
2014-09-11 00:00:00         1
2011-02-06 00:00:00         1
1998-12-03 00:00:00         1
Name: EINGEFUEGT_AM, Length: 5162, dtype: int64

MOBI_RASTER
Null percentage: 0.10451728583594866
1.0    355579
3.0    124055
2.0    118231
4.0     92804
5.0     81375
6.0     26029
Name: MOBI_RASTER, dtype: int64

DSL_FLAG
Null percentage: 0.10451728583594866
1.0    772388
0.0     25685
Name: DSL_FLAG, dtype: int64

FIRMENDICHTE
Null percentage: 0.10452514022896678
4.0    273637
3.0    181608
5.0    159217
2.0    139078
1.0     44526
Name: FIRMENDICHTE, dtype: int64

KONSUMZELLE
Null percentage: 0.10452514022896678
0.0    609591
1.0    188475
Name: KONSUMZELLE, dtype: int64

ANZ_STATISTISCHE_HAUSHALTE
Null percentage: 0.10456553425020282
1.0      219119
2.0      121485
3.0       61478
4.0       44864
5.0       40133
6.0       38466
7.0       36589
8.0       32816
9.0       28500
10.0      24100
11.0      19461
12.0      16296
13.0      13040
14.0      10616
15.0       8939
16.0       7314
17.0       6136
18.0       5220
19.0       4465
20.0       3840
21.0       3433
22.0       3221
23.0       2633
24.0       2485
25.0       2384
26.0       2042
27.0       1995
28.0       1930
29.0       1712
30.0       1661
          ...  
257.0         5
173.0         5
205.0         5
258.0         5
218.0         4
241.0         4
209.0         4
239.0         4
216.0         4
203.0         4
262.0         4
198.0         4
309.0         4
284.0         4
371.0         3
229.0         3
182.0         3
449.0         3
289.0         3
245.0         3
248.0         3
190.0         2
189.0         2
227.0         2
336.0         2
197.0         2
175.0         2
165.0         2
133.0         1
314.0         1
Name: ANZ_STATISTISCHE_HAUSHALTE, Length: 267, dtype: int64

STRUKTURTYP
Null percentage: 0.1091468894920564
3.0    555713
1.0    127607
2.0    110627
Name: STRUKTURTYP, dtype: int64

GEMEINDETYP
Null percentage: 0.1091468894920564
22.0    151307
11.0    150715
40.0    125571
30.0    122406
12.0    120145
21.0     72777
50.0     51026
Name: GEMEINDETYP, dtype: int64

UMFELD_ALT
Null percentage: 0.10972138223852446
4.0    228222
3.0    208733
5.0    135160
2.0    121133
1.0    100187
Name: UMFELD_ALT, dtype: int64

UMFELD_JUNG
Null percentage: 0.10972138223852446
5.0    350532
4.0    225939
3.0    130403
2.0     53460
1.0     33101
Name: UMFELD_JUNG, dtype: int64

CAMEO_INTL_2015
Null percentage: 0.11105999522004081
51      77576
51.0    56118
41      53459
24      52882
41.0    38877
24.0    38276
14      36524
43      32730
14.0    26360
54      26207
43.0    23942
25      22837
54.0    19184
22      19173
25.0    16791
23      15653
13      15272
45      15206
22.0    13982
55      13842
52      11836
23.0    11097
13.0    11064
31      11041
45.0    10926
34      10737
55.0    10113
15       9832
52.0     8706
44       8543
31.0     7983
34.0     7787
12       7645
15.0     7142
44.0     6277
35       6090
32       6067
33       5833
12.0     5604
32.0     4287
35.0     4266
33.0     4102
XX        373
Name: CAMEO_INTL_2015, dtype: int64

KBA13_CCM_1401_2500
Null percentage: 0.11871354018812394
3.0    359093
2.0    174525
4.0    157962
1.0     60428
5.0     33413
Name: KBA13_CCM_1401_2500, dtype: int64

KBA13_KMH_210
Null percentage: 0.11871354018812394
3.0    361259
4.0    164843
2.0    161113
5.0     55495
1.0     42711
Name: KBA13_KMH_210, dtype: int64

KBA13_GBZ
Null percentage: 0.11871354018812394
3.0    284563
4.0    184003
5.0    167473
2.0    109422
1.0     39960
Name: KBA13_GBZ, dtype: int64

KBA13_CCM_3001
Null percentage: 0.11871354018812394
1.0    338439
4.0    215760
3.0    147448
5.0     83682
2.0        92
Name: KBA13_CCM_3001, dtype: int64

KBA13_BAUMAX
Null percentage: 0.11871354018812394
1.0    491118
5.0    115476
2.0     69249
3.0     59060
4.0     50518
Name: KBA13_BAUMAX, dtype: int64

KBA13_HHZ
Null percentage: 0.11871354018812394
3.0    319964
4.0    212119
5.0    168143
2.0     72085
1.0     13110
Name: KBA13_HHZ, dtype: int64

HH_DELTA_FLAG
Null percentage: 0.12073548536221655
0.0    710942
1.0     72677
Name: HH_DELTA_FLAG, dtype: int64

KBA13_ANTG1
Null percentage: 0.12769896580085074
2.0    299448
3.0    219174
1.0    200723
4.0     58068
Name: KBA13_ANTG1, dtype: int64

KBA13_ANTG2
Null percentage: 0.13080032898686184
3.0    325207
2.0    207993
4.0    182916
1.0     58533
Name: KBA13_ANTG2, dtype: int64

VHN
Null percentage: 0.17735668257368262
2.0    233844
3.0    179579
4.0    178413
1.0    141321
Name: VHN, dtype: int64

KBA13_CCM_3000
Null percentage: 0.182179279886807
3.0    308099
1.0    149271
2.0    103136
4.0     92197
5.0     76156
0.0     56562
Name: KBA13_CCM_3000, dtype: int64

KBA13_ANTG3
Null percentage: 0.26928225434544295
2.0    251723
1.0    220761
3.0    178747
Name: KBA13_ANTG3, dtype: int64

D19_LETZTER_KAUF_BRANCHE
Null percentage: 0.2884952217239046
D19_UNBEKANNT             195338
D19_VERSICHERUNGEN         57734
D19_SONSTIGE               44722
D19_VOLLSORTIMENT          34812
D19_SCHUHE                 32578
D19_BUCH_CD                28754
D19_VERSAND_REST           26034
D19_DROGERIEARTIKEL        24072
D19_BANKEN_DIREKT          23273
D19_BEKLEIDUNG_REST        21796
D19_HAUS_DEKO              20858
D19_TELKO_MOBILE           14447
D19_ENERGIE                12084
D19_TELKO_REST             11472
D19_BANKEN_GROSS           10550
D19_BEKLEIDUNG_GEH         10272
D19_KINDERARTIKEL           7301
D19_FREIZEIT                7257
D19_TECHNIK                 7002
D19_LEBENSMITTEL            6458
D19_BANKEN_REST             5247
D19_RATGEBER                4931
D19_NAHRUNGSERGAENZUNG      4061
D19_DIGIT_SERV              3577
D19_REISEN                  3122
D19_TIERARTIKEL             2578
D19_SAMMELARTIKEL           2443
D19_HANDWERK                2227
D19_WEIN_FEINKOST           2164
D19_GARTEN                  1646
D19_BANKEN_LOKAL            1442
D19_BIO_OEKO                1232
D19_BILDUNG                  980
D19_LOTTO                    839
D19_KOSMETIK                 805
Name: D19_LETZTER_KAUF_BRANCHE, dtype: int64

ALTERSKATEGORIE_FEIN
Null percentage: 0.3412565457950385
15.0    63486
14.0    59709
16.0    53384
18.0    51365
17.0    50011
13.0    49556
12.0    42951
19.0    42340
10.0    34903
11.0    33061
20.0    27833
9.0     26204
8.0     14516
21.0    13658
7.0      8578
6.0      3754
22.0     3669
23.0     2838
24.0     2340
25.0     1017
5.0       994
4.0       636
3.0       218
2.0        64
1.0         1
Name: ALTERSKATEGORIE_FEIN, dtype: int64

VERDICHTUNGSRAUM
Null percentage: 0.5229409989217041
1.0     111235
2.0      47613
3.0      29827
4.0      26996
5.0      24019
6.0      21882
7.0      13238
8.0      11864
10.0     11034
9.0       9425
13.0      8707
11.0      8226
14.0      8180
12.0      8046
15.0      6942
16.0      6435
17.0      5502
18.0      5061
20.0      3538
22.0      3492
21.0      3364
19.0      3300
23.0      3239
24.0      2980
25.0      2887
30.0      2648
27.0      2622
26.0      2569
29.0      2552
28.0      2454
32.0      2390
31.0      2313
33.0      2240
34.0      2054
36.0      1959
35.0      1769
39.0      1660
38.0      1622
44.0      1435
40.0      1359
37.0      1348
41.0      1329
42.0      1324
43.0      1321
45.0      1165
Name: VERDICHTUNGSRAUM, dtype: int64

KBA13_ANTG4
Null percentage: 0.5459869100930073
1.0    277982
2.0    126644
Name: KBA13_ANTG4, dtype: int64

D19_GESAMT_ANZ_24
Null percentage: 0.5669783364619999
0    505303
2    101785
1     86493
4     74210
3     58554
5     46547
6     18329
Name: D19_GESAMT_ANZ_24, dtype: int64

D19_SONSTIGE
Null percentage: 0.5677076729565393
6.0    220478
7.0     58373
3.0     44578
5.0     35028
2.0     10816
4.0     10204
1.0      5791
Name: D19_SONSTIGE, dtype: int64

D19_VERSAND_ANZ_24
Null percentage: 0.6326354518127378
0    563818
2     93666
1     90253
4     55016
3     47832
5     30398
6     10238
Name: D19_VERSAND_ANZ_24, dtype: int64

KK_KUNDENTYP
Null percentage: 0.6559674873011295
3.0    65151
2.0    62564
5.0    48038
4.0    44512
6.0    44114
1.0    42230
Name: KK_KUNDENTYP, dtype: int64

D19_GESAMT_ANZ_12
Null percentage: 0.656175067688037
0    584797
1     99465
2     97282
3     45685
4     43579
5     16966
6      3447
Name: D19_GESAMT_ANZ_12, dtype: int64

D19_VOLLSORTIMENT
Null percentage: 0.6732359313795344
6.0    173626
3.0     44589
5.0     31303
7.0     23231
2.0      8250
4.0      6366
1.0      3854
Name: D19_VOLLSORTIMENT, dtype: int64

D19_BUCH_CD
Null percentage: 0.6988031027096534
6.0    188263
3.0     26192
5.0     15900
1.0     11735
2.0      9187
7.0      8853
4.0      8303
Name: D19_BUCH_CD, dtype: int64

D19_TECHNIK
Null percentage: 0.7070086993012956
6.0    190979
7.0     41174
5.0     14389
3.0     11064
4.0      1846
2.0      1163
1.0       505
Name: D19_TECHNIK, dtype: int64

D19_VERSAND_ANZ_12
Null percentage: 0.7158404032220964
0    637972
1     96577
2     81616
3     34258
4     29393
5      9712
6      1693
Name: D19_VERSAND_ANZ_12, dtype: int64

EXTSEL992
Null percentage: 0.7339963937115486
56.0    19722
31.0    14987
27.0    13269
38.0    12856
23.0    12742
36.0    12059
35.0    11308
55.0     9812
34.0     8583
50.0     6435
53.0     5686
37.0     5211
21.0     5114
54.0     4857
6.0      4815
41.0     4517
19.0     4445
29.0     4332
18.0     4315
39.0     4253
33.0     4199
25.0     4095
20.0     4069
26.0     3087
32.0     3041
15.0     2917
48.0     2916
17.0     2868
14.0     2832
40.0     2787
3.0      2783
2.0      2701
43.0     2602
46.0     2550
22.0     2244
24.0     2238
47.0     1659
1.0      1526
4.0      1468
13.0     1458
30.0     1457
5.0      1437
52.0     1415
16.0     1147
45.0     1087
12.0     1027
9.0       983
42.0      912
10.0      866
11.0      709
51.0      674
8.0       642
7.0       546
44.0      447
49.0      251
28.0      110
Name: EXTSEL992, dtype: int64

D19_VERSICHERUNGEN
Null percentage: 0.7345697644018712
6.0    116559
3.0     44933
5.0     30998
2.0     14892
4.0     13254
1.0     10107
7.0      5814
Name: D19_VERSICHERUNGEN, dtype: int64

D19_BEKLEIDUNG_REST
Null percentage: 0.7770261248332344
6.0    109025
3.0     30201
7.0     19637
5.0     18554
2.0      8748
1.0      7262
4.0      5292
Name: D19_BEKLEIDUNG_REST, dtype: int64

D19_HAUS_DEKO
Null percentage: 0.8001382373171189
6.0    100720
3.0     31024
5.0     20748
7.0     10278
2.0      7476
1.0      4133
4.0      3742
Name: D19_HAUS_DEKO, dtype: int64

D19_TELKO_MOBILE
Null percentage: 0.8155148947343027
6.0    116433
3.0     16429
5.0     14257
7.0      8222
4.0      3573
2.0      3558
1.0      1945
Name: D19_TELKO_MOBILE, dtype: int64

D19_BANKEN_DIREKT
Null percentage: 0.8177668614182116
6.0    84798
3.0    27350
5.0    18539
4.0     8771
2.0     8119
7.0     7656
1.0     7177
Name: D19_BANKEN_DIREKT, dtype: int64

D19_VERSAND_REST
Null percentage: 0.8240851595732147
6.0    69248
3.0    39251
5.0    25227
2.0     7560
7.0     5770
4.0     4901
1.0     4822
Name: D19_VERSAND_REST, dtype: int64

D19_REISEN
Null percentage: 0.8268701029262102
6.0    94123
7.0    45315
3.0     5758
5.0     5149
2.0     2975
4.0      890
1.0       87
Name: D19_REISEN, dtype: int64

VHA
Null percentage: 0.8292511060668454
1.0    81016
4.0    24469
5.0    22372
3.0    19445
2.0     4873
Name: VHA, dtype: int64

D19_KOSMETIK
Null percentage: 0.8368698672944197
6.0    89353
7.0    52858
5.0     1292
3.0     1263
4.0      269
2.0      249
1.0      101
Name: D19_KOSMETIK, dtype: int64

D19_LOTTO
Null percentage: 0.839248626322764
7.0    113486
6.0     25736
5.0      2011
3.0      1869
4.0        78
2.0        66
1.0        19
Name: D19_LOTTO, dtype: int64

D19_KINDERARTIKEL
Null percentage: 0.8408296034316965
6.0    79320
7.0    22597
3.0    16402
5.0    13398
2.0     5842
4.0     3224
1.0     1073
Name: D19_KINDERARTIKEL, dtype: int64

D19_DROGERIEARTIKEL
Null percentage: 0.8539004354699901
6.0    52060
3.0    24763
5.0    17548
4.0    10951
2.0     9928
7.0     7841
1.0     7116
Name: D19_DROGERIEARTIKEL, dtype: int64

D19_SOZIALES
Null percentage: 0.8560626376622633
4.0    36514
5.0    30414
1.0    25128
3.0    21483
2.0    14741
Name: D19_SOZIALES, dtype: int64

D19_TELKO_REST
Null percentage: 0.8594647118952539
6.0    88346
5.0    15301
3.0    11467
7.0     5905
4.0     2402
2.0     1397
1.0      430
Name: D19_TELKO_REST, dtype: int64

D19_HANDWERK
Null percentage: 0.8621666230934864
6.0    90968
7.0    25319
5.0     3810
3.0     2465
4.0      184
2.0       87
1.0        7
Name: D19_HANDWERK, dtype: int64

D19_SCHUHE
Null percentage: 0.8673763297767894
3.0    46487
6.0    20589
5.0    19947
2.0    17071
7.0     5632
1.0     4976
4.0     3495
Name: D19_SCHUHE, dtype: int64

D19_VERSI_ANZ_24
Null percentage: 0.8718791410884618
0    777037
1     63340
2     37144
3      8848
4      4048
5       707
6        97
Name: D19_VERSI_ANZ_24, dtype: int64

D19_BANKEN_GROSS
Null percentage: 0.8812079158816949
6.0    57103
3.0    14862
5.0    13911
4.0     8165
1.0     6347
2.0     5482
Name: D19_BANKEN_GROSS, dtype: int64

D19_FREIZEIT
Null percentage: 0.8872636528986637
6.0    55056
3.0    15297
5.0    13587
7.0     8514
4.0     3990
2.0     2676
1.0     1353
Name: D19_FREIZEIT, dtype: int64

D19_BANKEN_ANZ_24
Null percentage: 0.8910247850981967
0    794100
1     43554
2     29079
3     10214
4      9041
5      3930
6      1303
Name: D19_BANKEN_ANZ_24, dtype: int64

D19_SAMMELARTIKEL
Null percentage: 0.8999844034195783
6.0    71297
7.0    10067
5.0     4072
3.0     3016
4.0      470
2.0      172
1.0       42
Name: D19_SAMMELARTIKEL, dtype: int64

ANZ_KINDER
Null percentage: 0.9029645845418813
1.0     55350
2.0     24445
3.0      5376
4.0      1057
5.0       190
6.0        47
7.0        10
9.0         3
11.0        1
8.0         1
Name: ANZ_KINDER, dtype: int64

D19_RATGEBER
Null percentage: 0.9033348630698783
6.0    44707
7.0    12287
3.0    10598
2.0     8334
5.0     7197
4.0     2136
1.0      891
Name: D19_RATGEBER, dtype: int64

D19_BEKLEIDUNG_GEH
Null percentage: 0.9080845267335487
6.0    39392
3.0    15239
5.0    11899
7.0     8478
2.0     3117
4.0     2013
1.0     1779
Name: D19_BEKLEIDUNG_GEH, dtype: int64

ALTER_KIND1
Null percentage: 0.9090483729624863
18.0    6703
17.0    6394
8.0     6343
7.0     6249
16.0    6124
15.0    6008
14.0    5992
9.0     5846
13.0    5713
10.0    5678
12.0    5576
11.0    5506
6.0     4875
5.0     1501
4.0     1084
3.0     1063
2.0      403
Name: ALTER_KIND1, dtype: int64

D19_BILDUNG
Null percentage: 0.9124066870058044
6.0    37502
7.0    21828
2.0     8582
3.0     5127
5.0     3363
4.0     1288
1.0      375
Name: D19_BILDUNG, dtype: int64

D19_VERSI_ANZ_12
Null percentage: 0.9215323696367119
0    821289
1     44933
2     20273
3      3335
4      1210
5       170
6        11
Name: D19_VERSI_ANZ_12, dtype: int64

D19_BANKEN_REST
Null percentage: 0.9220608580812166
6.0    43143
5.0     7744
7.0     7339
3.0     5943
2.0     2928
4.0     1448
1.0      916
Name: D19_BANKEN_REST, dtype: int64

D19_TELKO_ANZ_24
Null percentage: 0.9270517638161578
0    826208
1     46520
2     15343
3      2055
4       844
5       197
6        54
Name: D19_TELKO_ANZ_24, dtype: int64

D19_ENERGIE
Null percentage: 0.9311461466908881
6.0    25788
3.0    14572
5.0     9556
7.0     7655
2.0     1967
4.0     1185
1.0      641
Name: D19_ENERGIE, dtype: int64

D19_BANKEN_ANZ_12
Null percentage: 0.9332522460758892
0    831734
1     29771
2     18067
3      5708
4      4082
5      1483
6       376
Name: D19_BANKEN_ANZ_12, dtype: int64

D19_WEIN_FEINKOST
Null percentage: 0.9381982695650125
6.0    27556
7.0    20665
3.0     3460
5.0     2952
4.0      231
2.0      179
1.0       36
Name: D19_WEIN_FEINKOST, dtype: int64

D19_LEBENSMITTEL
Null percentage: 0.9401865530547417
6.0    27626
3.0    10044
7.0     8318
5.0     5198
2.0     1315
4.0      409
1.0      397
Name: D19_LEBENSMITTEL, dtype: int64

D19_GARTEN
Null percentage: 0.9555721869210891
6.0    20410
7.0     9555
5.0     4979
3.0     4003
4.0      328
2.0      265
1.0       55
Name: D19_GARTEN, dtype: int64

D19_NAHRUNGSERGAENZUNG
Null percentage: 0.9561893178010842
6.0    15778
7.0    11596
3.0     5768
5.0     4243
2.0      679
1.0      572
4.0      409
Name: D19_NAHRUNGSERGAENZUNG, dtype: int64

D19_TIERARTIKEL
Null percentage: 0.9562386882714837
6.0    20168
7.0     7945
3.0     6030
5.0     3809
2.0      553
4.0      455
1.0       41
Name: D19_TIERARTIKEL, dtype: int64

D19_BIO_OEKO
Null percentage: 0.9583189803651395
6.0    17732
7.0    15241
5.0     2121
3.0     1926
4.0       83
2.0       42
1.0        2
Name: D19_BIO_OEKO, dtype: int64

D19_DIGIT_SERV
Null percentage: 0.9623437957588522
6.0    17942
3.0     6153
7.0     4030
5.0     3225
2.0     1293
4.0      465
1.0      452
Name: D19_DIGIT_SERV, dtype: int64

D19_TELKO_ANZ_12
Null percentage: 0.9627129522307037
0    857990
1     24868
2      6954
3       865
4       406
5       103
6        35
Name: D19_TELKO_ANZ_12, dtype: int64

ALTER_KIND2
Null percentage: 0.9669004657655059
18.0    3128
14.0    3111
17.0    3085
15.0    3083
16.0    3010
13.0    2968
12.0    2628
11.0    2450
10.0    1953
9.0     1641
8.0     1179
7.0      627
6.0      396
5.0      154
4.0       67
3.0       15
2.0        4
Name: ALTER_KIND2, dtype: int64

D19_BANKEN_LOKAL
Null percentage: 0.9815130029476415
7.0    8522
3.0    3500
6.0    3202
5.0    1053
2.0     118
4.0      69
1.0      12
Name: D19_BANKEN_LOKAL, dtype: int64

ALTER_KIND3
Null percentage: 0.9930769135826019
18.0    866
15.0    847
16.0    841
17.0    826
14.0    746
13.0    674
12.0    438
11.0    363
10.0    237
9.0     159
8.0     102
7.0      40
6.0      21
5.0       8
4.0       2
Name: ALTER_KIND3, dtype: int64

D19_VERSI_ONLINE_QUOTE_12
Null percentage: 0.9981530955845969
10.0    1548
5.0       70
7.0       11
3.0        9
8.0        6
6.0        1
9.0        1
Name: D19_VERSI_ONLINE_QUOTE_12, dtype: int64

ALTER_KIND4
Null percentage: 0.9986479223447383
17.0    225
18.0    216
15.0    171
16.0    159
14.0    136
13.0    119
12.0     59
11.0     48
10.0     42
9.0      15
8.0      14
7.0       1
Name: ALTER_KIND4, dtype: int64

D19_TELKO_ONLINE_QUOTE_12
Null percentage: 0.9991158197573891
10.0    767
5.0      19
7.0       1
3.0       1
Name: D19_TELKO_ONLINE_QUOTE_12, dtype: int64

Note that this time we have included 0 and -1 in the Null percentage, but we can still see them in the value counts. And that is to give is the ability to see if 0 is or isn't really a missing value in some features.

Right now I'll try to find the translation of features if they don't have a huge percentage of missing values.

Notes:

  1. KOMBIALTER translates to combial, which I have no idea what it means, but it has no missing values. Howver, it's values are 1-2-3-4 and then 9, which could mean that 9 is the missing value.
  2. D19_KONSUMTYP_MAX (D19 CONSUMPTION TYPE MAX) and it also has no missing values.
  3. LNR has no missing values, and it looks like an indiviual ID so it should be dropped.
  4. CJTTYP(1-5), CJT_KATALOGNUTZER, RT_SCHNAEPPCHEN (RT Bargain) and RT_KEIN_ANREIZ (RT No Incentive) have 0.5% missing values which is trivial and I think it is going ot be removed with dropped rows.
  5. UNGLEICHENN_FLAG (Inequality flag) has 8% missing values, and I'm not sure if this feature has missing values encoded as 0 or not, but either way all of it's missing values will be dropped witht the rows.
  6. EINGEZOGENAM_HH_JAHR (RECOVERED HH YEAR) has 8% missing values which will all be dropped with the rows we are dropping, and it encodes information related to year.
  7. SOHO_KZ has 8% missing values, in addition to a more than 90% 0, and I don't understand what it means.
  8. AKT_DAT_KL, VK_DHT4A, VK_ZG11 and VK_DISTANZ, have 8% missing values, and I couldn't find anything related to it's translation.
  9. RT_UEBERGROESSE (RT OVER-SIZE) has 8% missing values.
  10. EINGEFUEGT_AM (INSERTED_AM) is the timestamp of insertion, but which data exactly? It has 10.5% missing values.
  11. DSL_FLAG and MOBI_RASTER have 10.5% missing values with no translation available.
  12. KONSUMZELLE (Consumer Cell) has 10.5% missing values and only 0-1 values, so I don't understand what does this feature mean exactly.
  13. FIRMENDICHTE (COMPANY DENSITY) has 10.5% missing values, and I think that all of the featuers so far that have 10.5% missing values come from the same source, which I don't know exactly what it is so far.
  14. ANZ_STATISTISCHE_HAUSHALTE (ANZ STATISTICAL BUDGETS) has 10.5% missing values. It's distribution seems to be right skewed.
  15. STRUKTURTYP (STRUCTURE TYPE) has 10.9% missing values.
  16. GEMEINDETYP (COMMUNITY TYPE) has 10.9% missing values, and it has weird values that don't make sense (not ordinal or nominal). I could further investigate if it has any relation with other community level features.
  17. UMFELD_JUNG (ENVIRONMENT YOUNG) and UMFELD_ALT (ENVIRONMENT OLD) has 10.9% missing values.
  18. CAMEO_INTL_2015 CAMEO is a consumer segmentation system linking address information to demographic, lifestyle and socio-economic insight. We could use KNN imputation to predict the missing values in CAMEO features since they are related to demographic data.
  19. All KB18 features are PLZ8 features.
  20. D19_LETZTER_KAUF_BRANCHE (D19 LAST PURCHASE SECTOR) has 28% missing values, and I think that it has valuable information, so it shouldn't be dropped with other D19 features.
  21. ALTERSKATEGORIE FEIN (AGE CATEGORY FINE) has 34% missing values (which are 0), and we know that they are missing because category 1 only has one data point, and 0 has 41188.
  22. The rest of the features have more than 50% missing values, so I'm going to drop them except ANZ_KINDER (probably the number of children), ALTER_KINDX (age of Xth child), as I understand their meaning and know that there missing values are not missing at random.

Which features are missing with EINGEFUEGT_AM?

In [69]:
%%time
# calculate all features null correlation
azdias_null_corr = azdias_new.replace(0, np.nan).isna().astype(int).corr()
CPU times: user 4min 36s, sys: 40.3 s, total: 5min 16s
Wall time: 12min 51s
In [70]:
# top 50 features correlated with null values in EINGEFUEGT_AM
azdias_null_corr["EINGEFUEGT_AM"].sort_values(ascending=False)
Out[70]:
MOBI_RASTER                   1.000000
GEBAEUDETYP                   1.000000
MIN_GEBAEUDEJAHR              1.000000
OST_WEST_KZ                   1.000000
EINGEFUEGT_AM                 1.000000
KBA05_MODTEMP                 1.000000
FIRMENDICHTE                  0.999958
GEBAEUDETYP_RASTER            0.999958
ANZ_STATISTISCHE_HAUSHALTE    0.999742
BALLRAUM                      0.996468
EWDICHTE                      0.996468
INNENSTADT                    0.996468
ARBEIT                        0.976356
RELAT_AB                      0.976356
ORTSGR_KLS9                   0.976029
GEMEINDETYP                   0.976029
STRUKTURTYP                   0.976029
UMFELD_ALT                    0.973157
UMFELD_JUNG                   0.973157
CAMEO_INTL_2015               0.966547
CAMEO_DEU_2015                0.966547
CAMEO_DEUG_2015               0.964504
ANZ_HAUSHALTE_AKTIV           0.963091
WOHNLAGE                      0.960450
KBA13_HALTER_65               0.930838
KBA13_HERST_SONST             0.930838
KBA13_HERST_ASIEN             0.930838
KBA13_HERST_AUDI_VW           0.930838
KBA13_HERST_BMW_BENZ          0.930838
KBA13_HERST_EUROPA            0.930838
                                ...   
D19_VERSAND_OFFLINE_DATUM          NaN
D19_VERSAND_ONLINE_DATUM           NaN
D19_VERSI_DATUM                    NaN
D19_VERSI_OFFLINE_DATUM            NaN
D19_VERSI_ONLINE_DATUM             NaN
FINANZ_ANLEGER                     NaN
FINANZ_HAUSBAUER                   NaN
FINANZ_MINIMALIST                  NaN
FINANZ_SPARER                      NaN
FINANZ_UNAUFFAELLIGER              NaN
FINANZ_VORSORGER                   NaN
FINANZTYP                          NaN
KOMBIALTER                         NaN
SEMIO_DOM                          NaN
SEMIO_ERL                          NaN
SEMIO_FAM                          NaN
SEMIO_KAEM                         NaN
SEMIO_KRIT                         NaN
SEMIO_KULT                         NaN
SEMIO_LUST                         NaN
SEMIO_MAT                          NaN
SEMIO_PFLICHT                      NaN
SEMIO_RAT                          NaN
SEMIO_REL                          NaN
SEMIO_SOZ                          NaN
SEMIO_TRADV                        NaN
SEMIO_VERT                         NaN
ZABEOTYP                           NaN
ANREDE_KZ                          NaN
ALTERSKATEGORIE_GROB               NaN
Name: EINGEFUEGT_AM, Length: 366, dtype: float64
In [71]:
# make mappings from feature to category
from collections import defaultdict

cats =  ["PLZ8",
             "Microcell (RR3_ID)",
             "Person",
             "Household",
             "Microcell (RR4_ID)",
             "Building",
             "RR1_ID",
             "Postcode",
             "Community"]


for cat in cats:
    mappings = {feat: cat for cat in cats for feat in dias_atts[dias_atts["Information level"] == cat].Attribute}
    
unknown_mappings = {feat: "Unknown" for feat in no_cat_feats.index}
In [72]:
# plot distributions of feature categories null values correlation with EINGEFUEGT_AM
eing_null_corr = azdias_null_corr["EINGEFUEGT_AM"].reset_index().replace({"index": mappings})
eing_null_corr = eing_null_corr.replace({"index": unknown_mappings})
plt.figure(figsize=(8, 6))
ax = plt.subplot()
for cat, cat_null in eing_null_corr.groupby('index'):
    cat_null.hist(label=cat, alpha=0.5, ax=ax)
plt.legend()
plt.tight_layout();

We can see that EINGEFUEGT_AM shares it's null values with features in difference categories, but the most prominent are Building, PLZ8, Household and some Unknown features. There is no clear pattern other than those, as it is probable that other features share null values in the same row, so there is no insight that we act upon.

Now it's time to prepare the dataset using the information we gathered so far

Data Cleaning Steps

  1. Clean columns with mixed types
  2. Drop columns with more than 50% missing values
  3. Replace encoded unknown values with null from Values sheet and
    1. Replace ALTER_HH 0 to null
    2. Replace KOMBIALTER 9 to null
  4. Features to drop:
    1. GEBURTSJAHR (year of birth) has 44% missing values
    2. AGER_TYP (best-ager typology) has 76% missing values
    3. CAMEO_DEU_2015 as it is categorical and needs one-hot-encoding while CAMEO_DEUG_2015 is ordinal and can be better used with PCA.
    4. LNR as it is an individual identifier
    5. All features with more than 50% missing values
  5. Feature engineering:
    1. D19_LETZTER_KAUF_BRANCHE need one hot encoding
    2. MIN_GEBAEUDEJAHR should be changed to number of years between 2017 and date
    3. EINGEFUEGT_AM should be changed to time between 2017 and timestamp
    4. Change ALTER_KINDX and ANZ_KINDER null values to 0
    5. Convert OST_WEST_KZ to binary labels
    6. Convert CAMEO_INTL_2015 and CAMEO_DEUG_2015 to int
  6. Impute missing values

Uncertainties

  1. KBA13_ANZAHL_PKW is supposed to encode the number of cars in the PLZ8, but it has high values of peculiar number which are 1400, 1500, 1300, etc. but I'll keep it.
  2. KONSUMZELLE (Consumer Cell) has 10.5% missing values and only 0-1 values, where 99% of these missing values will be dropped with rows we are dropping first. I don't understand what does this feature mean exactly. but I'll keep it.
  3. GEMEINDETYP (COMMUNITY TYPE) has 10.9% missing values, and it has weird values that don't make sense (not ordinal or nominal), so I'll drop it.
  4. RT_UEBERGROESSE (RT OVER-SIZE) has small percentage of values encoded as 0, which I don't know if they are missing or not. 89% of it's missing values will be dropped with rows, so I'll keep it.
In [73]:
def replace_unknown_with_null(df):
    """
    Replace unknown values encoded as 0 and -1 into null values.
    """
    # Read Values sheets which has information about unknown encodings
    dias_vals = pd.read_excel("DIAS Attributes - Values 2017.xlsx")
    
    # Find unknown values for features
    feat_unknown_vals = dias_vals.query("Meaning == 'unknown'")
    
    # Replace unknown values for each feature in the dataframe
    for feat in feat_unknown_vals.itertuples():
        # Check if feature in dataframe featuers
        if feat.Attribute in df.columns:
            # if unknown values are more than one
            if ',' in str(feat.Value):
                # loop over unknown values
                for val in str(feat.Value).split(','):
                    # replace unknown value with null
                    df[feat.Attribute].replace(eval(val), np.nan, inplace=True)
            else:
                # replace unknown value with null
                df[feat.Attribute].replace(feat.Value, np.nan, inplace=True) 
    
    # Replace other unknown values that aren't in Values sheet
    df["ALTER_HH"].replace(0, np.nan, inplace=True)
    df["KOMBIALTER"].replace(0, np.nan, inplace=True)
    
    # Replace Non-binary features with 0 and -1 that are of unknown category to null
    unknown_feats = df[list(set(df.columns).difference(dias_vals.Attribute))].copy()
    for feat in unknown_feats:
        if df[feat].nunique() > 2:
            df[feat].replace(-1, np.nan, inplace=True)
            df[feat].replace(0, np.nan, inplace=True)
    
    return df



def clean_dataset(df, p_row=0.5, p_col=0.5, drop_uncertain=True, keep_features=[]):
    """
    Clean dataset using insights gained during EDA.
    
    inputs
    
    1. df (pandas dataframe)
    2. p_thresh (float)  -  maximum threshold of null values in columns
    3. drop_uncertain (bool)  -  drop features we are uncertain from
    """
    # Make a new copy of the dataframe
    clean_df = df.copy()
    
    # Clean columns with mixed dtypes
    clean_df["CAMEO_DEUG_2015"].replace('X', np.nan, inplace=True)
    clean_df["CAMEO_INTL_2015"].replace('XX', np.nan, inplace=True)
    
    # Replace unknown values with missing
    clean_df = replace_unknown_with_null(clean_df)   

    # Drop rows with more than 50% missing values
    min_count = int(((1 - p_row))*clean_df.shape[1] + 1)
    clean_df.dropna(axis=0, thresh=min_count, inplace=True)
    
    # Drop duplicated rows
    clean_df.drop_duplicates(inplace=True)
       
    # Drop GEBURTSJAHR (year of birth) that has 44% missing values 
    clean_df.drop('GEBURTSJAHR', axis=1, inplace=True)
      
    # Drop LNR which is a unique indentifier
    clean_df.drop('LNR', axis=1, inplace=True)
    
    # Drop CAMEO_DEU_2015 as it's not suitable for PCA
    clean_df.drop('CAMEO_DEU_2015', axis=1, inplace=True)
    
    # Drop features with more than p_thresh missing values
    features_missing_p = clean_df.isna().sum() / clean_df.shape[0]
    features_above_p_thresh = clean_df.columns[features_missing_p > p_col]
    
    features_to_keep = ["ALTER_KIND1", "ALTER_KIND2", "ALTER_KIND3", "ALTER_KIND4", "ANZ_KINDER"] + keep_features
    features_to_remove = [feat for feat in features_above_p_thresh if feat not in features_to_keep]
    
    clean_df.drop(features_to_remove, axis=1, inplace=True)
    
    # Drop uncertain features
    if drop_uncertain:
        uncertain_features = ["GEMEINDETYP"]
        clean_df.drop(uncertain_features, axis=1, inplace=True)
        
    # Feature Engineering
    # One Hot Encoding D19_LETZTER_KAUF_BRANCHE
    dummies = pd.get_dummies(clean_df["D19_LETZTER_KAUF_BRANCHE"], prefix="D19_LETZTER_KAUF_BRANCHE")
    clean_df = pd.concat([clean_df, dummies], axis=1)
    clean_df.drop("D19_LETZTER_KAUF_BRANCHE", axis=1, inplace=True)
    
    # Calculate year difference in MIN_GEBAEUDEJAHR
    clean_df["MIN_GEBAEUDEJAHR_ENG"] = (2017 - clean_df["MIN_GEBAEUDEJAHR"])
    clean_df.drop("MIN_GEBAEUDEJAHR", axis=1, inplace=True)
    
    # Calculate days difference in EINGEFUEGT_AM
    current = datetime.strptime("2017-01-01", "%Y-%m-%d")
    clean_df["EINGEFUEGT_AM_DAY"] = (current - pd.to_datetime(clean_df["EINGEFUEGT_AM"])).dt.days
    clean_df.drop("EINGEFUEGT_AM", axis=1, inplace=True)
    
    # Replace null values in ALTER_KIND and ANZ_KINDER with 0 to avoid imputation
    for feat in clean_df.columns[clean_df.columns.str.startswith("ALTER_KIND")]:
        clean_df[feat].replace(np.nan, 0, inplace=True)
    clean_df["ANZ_KINDER"].replace(np.nan, 0, inplace=True)
    
    # Convert OST_WEST_KZ to binary labels
    clean_df["OST_WEST_KZ"] = (clean_df["OST_WEST_KZ"] == "W").astype(np.uint8)
    
    # Convert CAMEO_INTL_2015 and CAMEO_DEUG_2015 to float32
    CAMEO_feats = ["CAMEO_INTL_2015", "CAMEO_DEUG_2015"]
    clean_df[CAMEO_feats] = clean_df[CAMEO_feats].astype(np.float32)

    # Convert float16 features to float32 to enable arithmetic operations
    float_feats = clean_df.select_dtypes(np.float16).columns
    clean_df[float_feats] = clean_df[float_feats].astype(np.float32)
    
    return clean_df    
In [22]:
clean_azdias = clean_dataset(azdias)
In [81]:
# pickle clean AZDIAS 
pd.to_pickle(clean_azdias, "clean_azdias.pkl")
In [74]:
# Delete AZDIAS to free up memory
del azdias

gc.collect()
Out[74]:
56
In [75]:
clean_azdias = pd.read_pickle("clean_azdias.pkl")

Inspecting features after cleaning dataset

In [76]:
features_missing = (clean_azdias.isna().sum() / clean_azdias.shape[0]).sort_values(ascending=False)
In [77]:
for feat, p in features_missing.iteritems():
    print(feat)
    print(p)
    print(clean_azdias[feat].value_counts())
    print()
KBA13_ANTG4
0.48839738070221356
1.0    277982
2.0    126644
Name: KBA13_ANTG4, dtype: int64

KBA05_BAUMAX
0.4762415934272265
1.0    208383
5.0     98624
3.0     59936
4.0     37617
2.0      9680
Name: KBA05_BAUMAX, dtype: int64

VERDICHTUNGSRAUM
0.46667652886146016
1.0     110530
2.0      47117
3.0      29569
4.0      26660
5.0      23830
6.0      21685
7.0      13118
8.0      11779
10.0     10976
9.0       9399
13.0      8662
11.0      8165
14.0      8142
12.0      8009
15.0      6907
16.0      6379
17.0      5462
18.0      5008
20.0      3522
22.0      3463
21.0      3343
19.0      3285
23.0      3208
24.0      2950
25.0      2849
30.0      2616
27.0      2594
26.0      2549
29.0      2513
28.0      2433
32.0      2366
31.0      2287
33.0      2221
34.0      2037
36.0      1937
35.0      1757
39.0      1650
38.0      1607
44.0      1419
40.0      1348
37.0      1342
41.0      1324
42.0      1316
43.0      1315
45.0      1157
Name: VERDICHTUNGSRAUM, dtype: int64

ALTER_HH
0.28806712361502546
18.0    58688
17.0    53760
19.0    51093
16.0    50290
15.0    50265
14.0    42870
21.0    40527
20.0    39559
13.0    36522
12.0    33862
10.0    29538
11.0    27133
9.0     22178
8.0     13095
7.0      8197
6.0      3676
5.0       997
4.0       582
3.0       189
2.0        45
1.0         1
Name: ALTER_HH, dtype: int64

ALTERSKATEGORIE_FEIN
0.2806287528496053
15.0    61564
14.0    57972
16.0    51729
18.0    49468
17.0    48241
13.0    48113
12.0    41719
19.0    40907
10.0    33924
11.0    32175
20.0    26989
9.0     25499
8.0     14134
21.0    13242
7.0      8339
6.0      3619
22.0     3540
23.0     2707
24.0     2236
25.0      987
5.0       964
4.0       613
3.0       207
2.0        61
1.0         1
Name: ALTERSKATEGORIE_FEIN, dtype: int64

D19_VERSAND_ONLINE_QUOTE_12
0.22405136433349898
0.0     404587
10.0    180970
5.0       7785
8.0       6181
7.0       4764
9.0       3779
3.0       2581
6.0       1054
2.0        726
4.0        720
1.0        550
Name: D19_VERSAND_ONLINE_QUOTE_12, dtype: int64

D19_KONSUMTYP
0.22405136433349898
9.0    246671
1.0    113725
4.0     75647
6.0     54578
3.0     51751
2.0     47746
5.0     23579
Name: D19_KONSUMTYP, dtype: int64

D19_GESAMT_ONLINE_QUOTE_12
0.22405136433349898
0.0     381081
10.0    192861
5.0      10181
8.0       9121
7.0       6685
9.0       5817
3.0       3441
6.0       1638
2.0       1035
4.0        987
1.0        850
Name: D19_GESAMT_ONLINE_QUOTE_12, dtype: int64

D19_BANKEN_ONLINE_QUOTE_12
0.22405136433349898
0.0     569978
10.0     42586
5.0        380
3.0        211
7.0        210
8.0        169
9.0         67
6.0         45
2.0         33
4.0         16
1.0          2
Name: D19_BANKEN_ONLINE_QUOTE_12, dtype: int64

KBA13_ANTG3
0.17659397723350265
2.0    251723
1.0    220761
3.0    178747
Name: KBA13_ANTG3, dtype: int64

REGIOTYP
0.07751811546101335
6.0    194519
5.0    144818
3.0     93422
2.0     91169
7.0     83446
4.0     67808
1.0     54408
Name: REGIOTYP, dtype: int64

KKK
0.07751811546101335
3.0    271799
2.0    180538
4.0    177959
1.0     99294
Name: KKK, dtype: int64

VHN
0.07751811546101335
2.0    232818
3.0    178765
4.0    177388
1.0    140619
Name: VHN, dtype: int64

W_KEIT_KIND_HH
0.0741763486867476
6.0    279629
4.0    128078
3.0     97646
2.0     81563
1.0     80920
5.0     64397
Name: W_KEIT_KIND_HH, dtype: int64

KBA05_MAXBJ
0.060858592563652246
1.0    256802
4.0    187434
2.0    183281
3.0    115249
Name: KBA05_MAXBJ, dtype: int64

KBA05_KW3
0.060858592563652246
1.0    233425
0.0    206759
2.0    160700
3.0     80316
4.0     61566
Name: KBA05_KW3, dtype: int64

KBA05_ALTER3
0.060858592563652246
3.0    292308
2.0    158654
4.0    156133
1.0     68114
5.0     67557
Name: KBA05_ALTER3, dtype: int64

KBA05_ALTER2
0.060858592563652246
3.0    287985
2.0    165736
4.0    159841
5.0     72196
1.0     57008
Name: KBA05_ALTER2, dtype: int64

KBA05_MAXHERST
0.060858592563652246
2.0    270599
3.0    209371
4.0    116381
1.0     81622
5.0     64793
Name: KBA05_MAXHERST, dtype: int64

KBA05_MAXSEG
0.060858592563652246
2.0    299057
1.0    202755
3.0    171881
4.0     69073
Name: KBA05_MAXSEG, dtype: int64

KBA05_MAXVORB
0.060858592563652246
2.0    323169
3.0    240748
1.0    178849
Name: KBA05_MAXVORB, dtype: int64

KBA05_MOD1
0.060858592563652246
0.0    285955
2.0    180675
1.0    140803
3.0     87687
4.0     47646
Name: KBA05_MOD1, dtype: int64

KBA05_MOD2
0.060858592563652246
3.0    301068
2.0    160924
4.0    157386
1.0     65663
5.0     57725
Name: KBA05_MOD2, dtype: int64

KBA05_MOD3
0.060858592563652246
3.0    276622
2.0    170323
4.0    165665
1.0     67881
5.0     62275
Name: KBA05_MOD3, dtype: int64

KBA05_MOD4
0.060858592563652246
3.0    223031
2.0    160031
4.0    130743
1.0     97828
5.0     80575
0.0     50558
Name: KBA05_MOD4, dtype: int64

KBA05_MOD8
0.060858592563652246
0.0    221780
1.0    217215
2.0    216569
3.0     87202
Name: KBA05_MOD8, dtype: int64

KBA05_MOTOR
0.060858592563652246
3.0    289701
2.0    222043
1.0    121046
4.0    109976
Name: KBA05_MOTOR, dtype: int64

KBA05_ALTER1
0.060858592563652246
2.0    228541
1.0    166962
3.0    166048
0.0    102732
4.0     78483
Name: KBA05_ALTER1, dtype: int64

KBA05_SEG1
0.060858592563652246
1.0    251066
0.0    246274
2.0    185847
3.0     59579
Name: KBA05_SEG1, dtype: int64

KBA05_SEG10
0.060858592563652246
2.0    267628
1.0    151466
3.0    148587
0.0    111725
4.0     63360
Name: KBA05_SEG10, dtype: int64

KBA05_SEG2
0.060858592563652246
3.0    300292
4.0    164183
2.0    152391
1.0     69353
5.0     56547
Name: KBA05_SEG2, dtype: int64

KBA05_SEG3
0.060858592563652246
3.0    271144
2.0    184311
4.0    163914
1.0     62336
5.0     61061
Name: KBA05_SEG3, dtype: int64

KBA05_SEG4
0.060858592563652246
3.0    322840
2.0    152452
4.0    143596
1.0     62144
5.0     61734
Name: KBA05_SEG4, dtype: int64

KBA05_SEG5
0.060858592563652246
1.0    234996
2.0    183346
0.0    182735
3.0     90969
4.0     50720
Name: KBA05_SEG5, dtype: int64

KBA05_SEG6
0.060858592563652246
0.0    654341
1.0     88425
Name: KBA05_SEG6, dtype: int64

KBA05_SEG7
0.060858592563652246
0.0    367933
1.0    183792
2.0    141092
3.0     49949
Name: KBA05_SEG7, dtype: int64

KBA05_SEG8
0.060858592563652246
0.0    403677
1.0    173698
2.0    120168
3.0     45223
Name: KBA05_SEG8, dtype: int64

KBA05_SEG9
0.060858592563652246
0.0    257566
1.0    240633
2.0    188020
3.0     56547
Name: KBA05_SEG9, dtype: int64

KBA05_VORB0
0.060858592563652246
3.0    243681
2.0    173117
4.0    162090
1.0    107025
5.0     56853
Name: KBA05_VORB0, dtype: int64

KBA05_ZUL4
0.060858592563652246
2.0    183050
1.0    174845
3.0    125233
0.0    105526
4.0    100306
5.0     53806
Name: KBA05_ZUL4, dtype: int64

KBA05_ZUL3
0.060858592563652246
3.0    224489
2.0    160610
4.0    156483
1.0     73840
0.0     71244
5.0     56100
Name: KBA05_ZUL3, dtype: int64

KBA05_ZUL2
0.060858592563652246
3.0    288479
2.0    166353
4.0    159801
1.0     64701
5.0     63432
Name: KBA05_ZUL2, dtype: int64

KBA05_ZUL1
0.060858592563652246
3.0    299251
4.0    158207
2.0    156789
1.0     67616
5.0     60903
Name: KBA05_ZUL1, dtype: int64

KBA05_MAXAH
0.060858592563652246
3.0    209046
5.0    194964
2.0    185611
4.0    102158
1.0     50987
Name: KBA05_MAXAH, dtype: int64

KBA05_VORB1
0.060858592563652246
3.0    310047
2.0    153256
4.0    148641
5.0     65594
1.0     65228
Name: KBA05_VORB1, dtype: int64

KBA05_KW2
0.060858592563652246
3.0    306109
2.0    155105
4.0    152004
5.0     64903
1.0     64645
Name: KBA05_KW2, dtype: int64

KBA05_KRSHERST1
0.060858592563652246
3.0    298972
2.0    174704
4.0    161686
1.0     63273
5.0     44131
Name: KBA05_KRSHERST1, dtype: int64

KBA05_CCM1
0.060858592563652246
3.0    289864
2.0    170775
4.0    148697
1.0     67730
5.0     65700
Name: KBA05_CCM1, dtype: int64

KBA05_CCM2
0.060858592563652246
3.0    300935
4.0    163286
2.0    157741
1.0     62096
5.0     58708
Name: KBA05_CCM2, dtype: int64

KBA05_CCM3
0.060858592563652246
3.0    285815
4.0    166252
2.0    153991
5.0     70461
1.0     66247
Name: KBA05_CCM3, dtype: int64

KBA05_CCM4
0.060858592563652246
0.0    273958
1.0    214595
2.0    128371
3.0     78581
4.0     47261
Name: KBA05_CCM4, dtype: int64

KBA05_DIESEL
0.060858592563652246
2.0    294486
3.0    163588
1.0    155395
4.0     64725
0.0     64572
Name: KBA05_DIESEL, dtype: int64

KBA05_FRAU
0.060858592563652246
3.0    303085
2.0    153823
4.0    146650
5.0     70077
1.0     69131
Name: KBA05_FRAU, dtype: int64

KBA05_HERST1
0.060858592563652246
2.0    225597
3.0    177055
1.0    118736
4.0     87464
0.0     75536
5.0     58378
Name: KBA05_HERST1, dtype: int64

KBA05_HERST2
0.060858592563652246
3.0    301810
2.0    172895
4.0    151955
5.0     60659
1.0     55447
Name: KBA05_HERST2, dtype: int64

KBA05_HERST3
0.060858592563652246
3.0    298279
2.0    159402
4.0    147216
5.0     60806
1.0     60295
0.0     16768
Name: KBA05_HERST3, dtype: int64

KBA05_HERST4
0.060858592563652246
3.0    259420
2.0    164282
4.0    142585
1.0     75096
5.0     70665
0.0     30718
Name: KBA05_HERST4, dtype: int64

KBA05_VORB2
0.060858592563652246
3.0    234648
2.0    160670
4.0    120185
5.0     88437
1.0     84499
0.0     54327
Name: KBA05_VORB2, dtype: int64

KBA05_KRSAQUOT
0.060858592563652246
3.0    283410
2.0    151321
4.0    143613
1.0     83913
5.0     80509
Name: KBA05_KRSAQUOT, dtype: int64

KBA05_HERST5
0.060858592563652246
3.0    242052
2.0    164252
4.0    159185
5.0     65075
1.0     64932
0.0     47270
Name: KBA05_HERST5, dtype: int64

KBA05_KRSVAN
0.060858592563652246
2.0    491818
1.0    125887
3.0    125061
Name: KBA05_KRSVAN, dtype: int64

KBA05_ALTER4
0.060858592563652246
3.0    298921
4.0    144014
2.0    138550
1.0     56803
5.0     54385
0.0     50093
Name: KBA05_ALTER4, dtype: int64

KBA05_KRSHERST2
0.060858592563652246
3.0    297773
2.0    159934
4.0    152348
1.0     71716
5.0     60995
Name: KBA05_KRSHERST2, dtype: int64

KBA05_KW1
0.060858592563652246
3.0    274730
4.0    160460
2.0    160141
1.0     78228
5.0     69207
Name: KBA05_KW1, dtype: int64

KBA05_KRSHERST3
0.060858592563652246
3.0    293164
2.0    154527
4.0    147653
5.0     82408
1.0     65014
Name: KBA05_KRSHERST3, dtype: int64

KBA05_KRSKLEIN
0.060858592563652246
2.0    436203
1.0    156545
3.0    150018
Name: KBA05_KRSKLEIN, dtype: int64

KBA05_KRSOBER
0.060858592563652246
2.0    464297
1.0    151976
3.0    126493
Name: KBA05_KRSOBER, dtype: int64

KBA05_KRSZUL
0.060858592563652246
2.0    379931
1.0    208444
3.0    154391
Name: KBA05_KRSZUL, dtype: int64

KBA05_ANHANG
0.05973834838582423
1.0    323310
0.0    266012
3.0     81488
2.0     72842
Name: KBA05_ANHANG, dtype: int64

KBA05_MOTRAD
0.05894178649865533
1.0    391869
0.0    204179
2.0     74208
3.0     74026
Name: KBA05_MOTRAD, dtype: int64

SHOPPER_TYP
0.04586426332565852
1.0    244539
2.0    205062
3.0    178260
0.0    126764
Name: SHOPPER_TYP, dtype: int64

HEALTH_TYP
0.04586426332565852
3.0    306325
2.0    293712
1.0    154588
Name: HEALTH_TYP, dtype: int64

VERS_TYP
0.04586426332565852
2.0    391943
1.0    362682
Name: VERS_TYP, dtype: int64

KBA05_GBZ
0.04310284878347299
3.0    197714
5.0    158794
4.0    155219
2.0    138187
1.0    106895
Name: KBA05_GBZ, dtype: int64

KBA05_ANTG1
0.04310284878347299
0.0    260515
1.0    160807
2.0    126623
3.0    117740
4.0     91124
Name: KBA05_ANTG1, dtype: int64

KBA05_AUTOQUOT
0.04310284878347299
3.0    257907
4.0    194626
2.0    123266
1.0     84088
5.0     82874
9.0     14048
Name: KBA05_AUTOQUOT, dtype: int64

KBA05_ANTG2
0.04310284878347299
0.0    291882
1.0    163613
2.0    138177
3.0    134271
4.0     28866
Name: KBA05_ANTG2, dtype: int64

MOBI_REGIO
0.04310284878347299
1.0    163324
3.0    150206
5.0    148635
4.0    148119
2.0    146189
6.0       336
Name: MOBI_REGIO, dtype: int64

KBA05_ANTG3
0.04310284878347299
0.0    510792
1.0     92574
2.0     80104
3.0     73339
Name: KBA05_ANTG3, dtype: int64

KBA05_ANTG4
0.04310284878347299
0.0    599781
1.0     83062
2.0     73966
Name: KBA05_ANTG4, dtype: int64

NATIONALITAET_KZ
0.04235939102211534
1.0    661978
2.0     63165
3.0     32254
Name: NATIONALITAET_KZ, dtype: int64

HH_DELTA_FLAG
0.041680416842099936
0.0    687283
1.0     70651
Name: HH_DELTA_FLAG, dtype: int64

RT_UEBERGROESSE
0.03579344518073736
5.0    175541
4.0    157033
3.0    145649
2.0    142794
1.0    141573
Name: RT_UEBERGROESSE, dtype: int64

PRAEGENDE_JUGENDJAHRE
0.035001940829360007
14.0    181205
8.0     140378
10.0     85145
5.0      84311
3.0      53543
15.0     41929
11.0     35357
9.0      33418
6.0      25605
12.0     24302
1.0      20467
4.0      20402
2.0       7471
13.0      5686
7.0       3997
Name: PRAEGENDE_JUGENDJAHRE, dtype: int64

PLZ8_HHZ
0.020694172075069003
3.0    309085
4.0    211876
5.0    175780
2.0     66862
1.0     10929
Name: PLZ8_HHZ, dtype: int64

PLZ8_ANTG1
0.020694172075069003
2.0    270523
3.0    222305
1.0    189214
4.0     87021
0.0      5469
Name: PLZ8_ANTG1, dtype: int64

PLZ8_ANTG2
0.020694172075069003
3.0    307236
2.0    215695
4.0    190979
1.0     53188
0.0      7434
Name: PLZ8_ANTG2, dtype: int64

PLZ8_ANTG3
0.020694172075069003
2.0    252957
1.0    237825
3.0    164015
0.0    119735
Name: PLZ8_ANTG3, dtype: int64

PLZ8_ANTG4
0.020694172075069003
0.0    356284
1.0    294942
2.0    123306
Name: PLZ8_ANTG4, dtype: int64

PLZ8_BAUMAX
0.020694172075069003
1.0    499423
5.0     97316
2.0     70394
4.0     56671
3.0     50728
Name: PLZ8_BAUMAX, dtype: int64

PLZ8_GBZ
0.020694172075069003
3.0    288336
4.0    180220
5.0    153844
2.0    111539
1.0     40593
Name: PLZ8_GBZ, dtype: int64

KBA13_ANTG2
0.020546239153166206
3.0    325207
2.0    207993
4.0    182916
1.0     58533
Name: KBA13_ANTG2, dtype: int64

KBA05_HERSTTEMP
0.020298419899380325
3.0    275320
1.0    162304
2.0    157786
4.0    119956
5.0     59479
Name: KBA05_HERSTTEMP, dtype: int64

KBA13_ANTG1
0.017051481921206122
2.0    299448
3.0    219174
1.0    200723
4.0     58068
Name: KBA13_ANTG1, dtype: int64

KBA13_FIAT
0.006926295266525814
3.0    343347
4.0    174024
2.0    148334
5.0     78722
1.0     40994
Name: KBA13_FIAT, dtype: int64

KBA13_FORD
0.006926295266525814
3.0    335170
2.0    162145
4.0    154870
5.0     69233
1.0     64003
Name: KBA13_FORD, dtype: int64

KBA13_GBZ
0.006926295266525814
3.0    284563
4.0    184003
5.0    167473
2.0    109422
1.0     39960
Name: KBA13_GBZ, dtype: int64

KBA13_HALTER_65
0.006926295266525814
3.0    331364
4.0    175040
2.0    140351
5.0     85579
1.0     53087
Name: KBA13_HALTER_65, dtype: int64

KBA13_HALTER_25
0.006926295266525814
3.0    341430
2.0    165111
4.0    144771
1.0     72751
5.0     61358
Name: KBA13_HALTER_25, dtype: int64

KBA13_HALTER_30
0.006926295266525814
3.0    322185
2.0    155663
4.0    150541
5.0     90957
1.0     66075
Name: KBA13_HALTER_30, dtype: int64

KBA13_HALTER_35
0.006926295266525814
3.0    309769
4.0    160396
2.0    151032
5.0    100377
1.0     63847
Name: KBA13_HALTER_35, dtype: int64

KBA13_HALTER_40
0.006926295266525814
3.0    313672
4.0    159800
2.0    151984
5.0     95692
1.0     64273
Name: KBA13_HALTER_40, dtype: int64

KBA13_HALTER_45
0.006926295266525814
3.0    318028
4.0    160040
2.0    158198
5.0     79488
1.0     69667
Name: KBA13_HALTER_45, dtype: int64

KBA13_FAB_SONSTIGE
0.006926295266525814
3.0    345124
2.0    167481
4.0    153466
5.0     61004
1.0     58346
Name: KBA13_FAB_SONSTIGE, dtype: int64

KBA13_HALTER_50
0.006926295266525814
3.0    325071
2.0    183530
4.0    133612
1.0     89591
5.0     53617
Name: KBA13_HALTER_50, dtype: int64

KBA13_HALTER_55
0.006926295266525814
3.0    319411
2.0    183685
4.0    135541
1.0     92743
5.0     54041
Name: KBA13_HALTER_55, dtype: int64

KBA13_HALTER_60
0.006926295266525814
3.0    321266
2.0    172974
4.0    140907
1.0     88762
5.0     61512
Name: KBA13_HALTER_60, dtype: int64

KBA13_HERST_EUROPA
0.006926295266525814
3.0    341097
4.0    170642
2.0    151037
5.0     72872
1.0     49773
Name: KBA13_HERST_EUROPA, dtype: int64

KBA13_HALTER_66
0.006926295266525814
3.0    320451
4.0    175161
2.0    139386
5.0     86577
1.0     63846
Name: KBA13_HALTER_66, dtype: int64

KBA13_HERST_ASIEN
0.006926295266525814
3.0    338074
2.0    162979
4.0    155084
5.0     67474
1.0     61810
Name: KBA13_HERST_ASIEN, dtype: int64

KBA13_HERST_AUDI_VW
0.006926295266525814
3.0    336178
2.0    172160
4.0    149322
1.0     72901
5.0     54860
Name: KBA13_HERST_AUDI_VW, dtype: int64

KBA13_HERST_BMW_BENZ
0.006926295266525814
3.0    339754
4.0    180052
2.0    133074
5.0     86958
1.0     45583
Name: KBA13_HERST_BMW_BENZ, dtype: int64

KBA13_HERST_FORD_OPEL
0.006926295266525814
3.0    326805
2.0    164003
4.0    154044
1.0     74276
5.0     66293
Name: KBA13_HERST_FORD_OPEL, dtype: int64

KBA13_HERST_SONST
0.006926295266525814
3.0    345124
2.0    167481
4.0    153466
5.0     61004
1.0     58346
Name: KBA13_HERST_SONST, dtype: int64

KBA13_HHZ
0.006926295266525814
3.0    319964
4.0    212119
5.0    168143
2.0     72085
1.0     13110
Name: KBA13_HHZ, dtype: int64

KBA13_CCM_3001
0.006926295266525814
1.0    338439
4.0    215760
3.0    147448
5.0     83682
2.0        92
Name: KBA13_CCM_3001, dtype: int64

KBA13_KMH_180
0.006926295266525814
3.0    355363
2.0    170291
4.0    155595
1.0     61574
5.0     42598
Name: KBA13_KMH_180, dtype: int64

KBA13_KMH_0_140
0.006926295266525814
3.0    283566
1.0    234424
0.0     96010
4.0     92670
5.0     72077
2.0      6674
Name: KBA13_KMH_0_140, dtype: int64

KBA13_KMH_110
0.006926295266525814
1.0    627623
3.0     94175
2.0     63623
Name: KBA13_KMH_110, dtype: int64

KBA13_KMH_140
0.006926295266525814
1.0    249773
4.0    202067
3.0    167221
2.0     91648
5.0     74712
Name: KBA13_KMH_140, dtype: int64

KBA13_KMH_140_210
0.006926295266525814
3.0    361405
2.0    179354
4.0    133192
1.0     73767
5.0     37703
Name: KBA13_KMH_140_210, dtype: int64

KBA13_FAB_ASIEN
0.006926295266525814
3.0    340870
2.0    169019
4.0    152278
1.0     62422
5.0     60832
Name: KBA13_FAB_ASIEN, dtype: int64

KBA13_HALTER_20
0.006926295266525814
3.0    338233
2.0    184872
4.0    146416
1.0     66025
5.0     49875
Name: KBA13_HALTER_20, dtype: int64

KBA13_CCM_3000
0.006926295266525814
3.0    308099
1.0    149271
2.0    103136
4.0     92197
5.0     76156
0.0     56562
Name: KBA13_CCM_3000, dtype: int64

KBA13_BJ_2006
0.006926295266525814
3.0    356794
2.0    166960
4.0    160793
1.0     52194
5.0     48680
Name: KBA13_BJ_2006, dtype: int64

KBA13_KMH_211
0.006926295266525814
3.0    277452
2.0    162264
0.0    139463
4.0     88043
5.0     75823
1.0     42376
Name: KBA13_KMH_211, dtype: int64

KBA13_ALTERHALTER_30
0.006926295266525814
3.0    333405
2.0    160653
4.0    147128
1.0     72911
5.0     71324
Name: KBA13_ALTERHALTER_30, dtype: int64

KBA13_ALTERHALTER_45
0.006926295266525814
3.0    305775
4.0    161597
2.0    150705
5.0     97478
1.0     69866
Name: KBA13_ALTERHALTER_45, dtype: int64

KBA13_ALTERHALTER_60
0.006926295266525814
3.0    321522
2.0    188053
4.0    130673
1.0     93825
5.0     51348
Name: KBA13_ALTERHALTER_60, dtype: int64

KBA13_ALTERHALTER_61
0.006926295266525814
3.0    323096
4.0    177428
2.0    138065
5.0     87118
1.0     59714
Name: KBA13_ALTERHALTER_61, dtype: int64

KBA13_ANZAHL_PKW
0.006926295266525814
1400.0    11722
1500.0     8291
1300.0     6427
1600.0     6135
1700.0     3795
1800.0     2617
464.0      1604
417.0      1604
519.0      1600
534.0      1496
386.0      1458
1900.0     1450
395.0      1446
481.0      1417
455.0      1409
483.0      1393
452.0      1388
418.0      1384
454.0      1380
450.0      1380
494.0      1379
459.0      1379
492.0      1359
504.0      1340
387.0      1338
420.0      1337
439.0      1327
506.0      1326
388.0      1324
456.0      1323
          ...  
28.0         24
27.0         24
25.0         23
24.0         22
26.0         21
18.0         21
17.0         20
20.0         18
21.0         17
22.0         16
12.0         16
14.0         16
29.0         15
15.0         14
23.0         13
30.0         12
16.0         11
19.0         11
13.0         10
1.0           8
10.0          8
11.0          7
5.0           7
9.0           7
4.0           7
3.0           6
8.0           6
2.0           6
7.0           5
6.0           5
Name: KBA13_ANZAHL_PKW, Length: 1261, dtype: int64

KBA13_AUDI
0.006926295266525814
3.0    346606
2.0    162879
4.0    158506
5.0     60877
1.0     56553
Name: KBA13_AUDI, dtype: int64

KBA13_AUTOQUOTE
0.006926295266525814
3.0    322478
2.0    186121
4.0    130528
1.0    102004
5.0     44288
0.0         2
Name: KBA13_AUTOQUOTE, dtype: int64

KBA13_BAUMAX
0.006926295266525814
1.0    491118
5.0    115476
2.0     69249
3.0     59060
4.0     50518
Name: KBA13_BAUMAX, dtype: int64

KBA13_BJ_1999
0.006926295266525814
3.0    359190
2.0    166296
4.0    159513
1.0     51058
5.0     49364
Name: KBA13_BJ_1999, dtype: int64

KBA13_BJ_2000
0.006926295266525814
3.0    347385
2.0    164179
4.0    159567
1.0     57775
5.0     56515
Name: KBA13_BJ_2000, dtype: int64

KBA13_BJ_2004
0.006926295266525814
3.0    364930
2.0    166228
4.0    157705
1.0     49522
5.0     47036
Name: KBA13_BJ_2004, dtype: int64

KBA13_BJ_2008
0.006926295266525814
3.0    274885
2.0    170564
0.0    134372
4.0     86070
5.0     69750
1.0     49780
Name: KBA13_BJ_2008, dtype: int64

KBA13_CCM_2501
0.006926295266525814
3.0    294765
1.0    121797
2.0    106344
0.0     93403
4.0     91223
5.0     77889
Name: KBA13_CCM_2501, dtype: int64

KBA13_BJ_2009
0.006926295266525814
3.0    286198
1.0    119909
2.0    118034
0.0    101115
4.0     88293
5.0     71872
Name: KBA13_BJ_2009, dtype: int64

KBA13_BMW
0.006926295266525814
3.0    346193
4.0    176871
2.0    139903
5.0     83249
1.0     39205
Name: KBA13_BMW, dtype: int64

KBA13_CCM_0_1400
0.006926295266525814
3.0    268319
2.0    178378
0.0    138711
4.0     81885
1.0     60025
5.0     58103
Name: KBA13_CCM_0_1400, dtype: int64

KBA13_CCM_1000
0.006926295266525814
3.0    290316
1.0    120040
2.0    119210
0.0    103227
4.0     86726
5.0     65902
Name: KBA13_CCM_1000, dtype: int64

KBA13_CCM_1200
0.006926295266525814
3.0    278653
2.0    161690
0.0    145802
4.0     81631
1.0     61072
5.0     56573
Name: KBA13_CCM_1200, dtype: int64

KBA13_CCM_1400
0.006926295266525814
3.0    362764
2.0    169640
4.0    161205
5.0     49030
1.0     42782
Name: KBA13_CCM_1400, dtype: int64

KBA13_CCM_1401_2500
0.006926295266525814
3.0    359093
2.0    174525
4.0    157962
1.0     60428
5.0     33413
Name: KBA13_CCM_1401_2500, dtype: int64

KBA13_CCM_1500
0.006926295266525814
1.0    287731
4.0    206213
3.0    156747
5.0     68326
2.0     66404
Name: KBA13_CCM_1500, dtype: int64

KBA13_CCM_1600
0.006926295266525814
3.0    364222
2.0    167171
4.0    163429
5.0     51951
1.0     38648
Name: KBA13_CCM_1600, dtype: int64

KBA13_CCM_1800
0.006926295266525814
3.0    276578
2.0    179863
0.0    137534
4.0     81550
5.0     57795
1.0     52101
Name: KBA13_CCM_1800, dtype: int64

KBA13_CCM_2000
0.006926295266525814
3.0    363131
4.0    170342
2.0    160922
5.0     57234
1.0     33792
Name: KBA13_CCM_2000, dtype: int64

KBA13_CCM_2500
0.006926295266525814
3.0    283932
2.0    144530
1.0    102470
0.0     95350
4.0     88885
5.0     70254
Name: KBA13_CCM_2500, dtype: int64

KBA13_KMH_210
0.006926295266525814
3.0    361259
4.0    164843
2.0    161113
5.0     55495
1.0     42711
Name: KBA13_KMH_210, dtype: int64

KBA13_KRSSEG_VAN
0.006926295266525814
2.0    487626
1.0    169327
3.0    127767
0.0       701
Name: KBA13_KRSSEG_VAN, dtype: int64

KBA13_KMH_250
0.006926295266525814
3.0    278290
2.0    161652
0.0    139756
4.0     88055
5.0     75224
1.0     42444
Name: KBA13_KMH_250, dtype: int64

KBA13_PEUGEOT
0.006926295266525814
3.0    340805
4.0    170378
2.0    154373
5.0     70008
1.0     49857
Name: KBA13_PEUGEOT, dtype: int64

KBA13_SEG_GELAENDEWAGEN
0.006926295266525814
3.0    345193
2.0    178645
4.0    143742
1.0     67786
5.0     50055
Name: KBA13_SEG_GELAENDEWAGEN, dtype: int64

KBA13_SEG_GROSSRAUMVANS
0.006926295266525814
3.0    340642
4.0    169944
2.0    152717
5.0     71749
1.0     50369
Name: KBA13_SEG_GROSSRAUMVANS, dtype: int64

KBA13_SEG_KLEINST
0.006926295266525814
3.0    337591
2.0    161954
4.0    158570
1.0     64938
5.0     62368
Name: KBA13_SEG_KLEINST, dtype: int64

KBA13_SEG_KLEINWAGEN
0.006926295266525814
3.0    341514
2.0    167806
4.0    152870
1.0     68565
5.0     54666
Name: KBA13_SEG_KLEINWAGEN, dtype: int64

KBA13_SEG_KOMPAKTKLASSE
0.006926295266525814
3.0    344398
2.0    173314
4.0    142952
1.0     64602
5.0     60155
Name: KBA13_SEG_KOMPAKTKLASSE, dtype: int64

KBA13_SEG_MINIVANS
0.006926295266525814
3.0    341862
2.0    161436
4.0    160044
5.0     65130
1.0     56949
Name: KBA13_SEG_MINIVANS, dtype: int64

KBA13_SEG_MINIWAGEN
0.006926295266525814
3.0    339598
4.0    176093
2.0    146739
5.0     77150
1.0     45841
Name: KBA13_SEG_MINIWAGEN, dtype: int64

KBA13_SEG_MITTELKLASSE
0.006926295266525814
3.0    337241
4.0    164192
2.0    156862
5.0     73287
1.0     53839
Name: KBA13_SEG_MITTELKLASSE, dtype: int64

KBA13_SEG_OBEREMITTELKLASSE
0.006926295266525814
3.0    342284
4.0    184285
2.0    132852
5.0     81830
1.0     44170
Name: KBA13_SEG_OBEREMITTELKLASSE, dtype: int64

KBA13_SEG_OBERKLASSE
0.006926295266525814
3.0    283488
1.0    157682
4.0     91369
0.0     86270
5.0     84648
2.0     81964
Name: KBA13_SEG_OBERKLASSE, dtype: int64

KBA13_SEG_SONSTIGE
0.006926295266525814
3.0    352268
2.0    167674
4.0    165534
5.0     64535
1.0     35410
Name: KBA13_SEG_SONSTIGE, dtype: int64

KBA13_SEG_SPORTWAGEN
0.006926295266525814
3.0    267922
2.0    146930
1.0    106267
4.0     92168
5.0     88713
0.0     83421
Name: KBA13_SEG_SPORTWAGEN, dtype: int64

KBA13_SEG_UTILITIES
0.006926295266525814
3.0    346456
2.0    164012
4.0    160080
5.0     61579
1.0     53294
Name: KBA13_SEG_UTILITIES, dtype: int64

KBA13_SEG_WOHNMOBILE
0.006926295266525814
3.0    269140
2.0    165205
1.0     95326
4.0     88952
0.0     85796
5.0     81002
Name: KBA13_SEG_WOHNMOBILE, dtype: int64

KBA13_SITZE_4
0.006926295266525814
3.0    328454
4.0    181695
2.0    129303
5.0     93443
1.0     52526
Name: KBA13_SITZE_4, dtype: int64

KBA13_SITZE_5
0.006926295266525814
3.0    330228
2.0    179693
4.0    128787
1.0     91495
5.0     55218
Name: KBA13_SITZE_5, dtype: int64

KBA13_SITZE_6
0.006926295266525814
3.0    336224
4.0    166528
2.0    147292
5.0     76974
1.0     58403
Name: KBA13_SITZE_6, dtype: int64

KBA13_TOYOTA
0.006926295266525814
3.0    343193
4.0    166426
2.0    156050
5.0     72002
1.0     47750
Name: KBA13_TOYOTA, dtype: int64

KBA13_VORB_0
0.006926295266525814
3.0    349837
4.0    174276
2.0    153752
5.0     71730
1.0     35826
Name: KBA13_VORB_0, dtype: int64

KBA13_VORB_1
0.006926295266525814
3.0    361449
2.0    167076
4.0    158150
1.0     50939
5.0     47807
Name: KBA13_VORB_1, dtype: int64

KBA13_VORB_1_2
0.006926295266525814
3.0    359262
2.0    173047
4.0    151120
1.0     61834
5.0     40158
Name: KBA13_VORB_1_2, dtype: int64

KBA13_VORB_2
0.006926295266525814
3.0    363866
2.0    166317
4.0    162515
5.0     49491
1.0     43232
Name: KBA13_VORB_2, dtype: int64

KBA13_VORB_3
0.006926295266525814
3.0    264685
2.0    177458
0.0    143284
4.0     81176
5.0     64131
1.0     54687
Name: KBA13_VORB_3, dtype: int64

KBA13_VW
0.006926295266525814
3.0    336588
2.0    171175
4.0    149018
1.0     71506
5.0     57134
Name: KBA13_VW, dtype: int64

KBA13_KMH_251
0.006926295266525814
1.0    674722
3.0    100548
2.0     10151
Name: KBA13_KMH_251, dtype: int64

KBA13_RENAULT
0.006926295266525814
3.0    336384
4.0    163307
2.0    160781
5.0     70384
1.0     54565
Name: KBA13_RENAULT, dtype: int64

KBA13_SEG_VAN
0.006926295266525814
3.0    341438
4.0    165710
2.0    157763
5.0     67515
1.0     52995
Name: KBA13_SEG_VAN, dtype: int64

KBA13_OPEL
0.006926295266525814
3.0    327618
2.0    164244
4.0    154681
1.0     72559
5.0     66319
Name: KBA13_OPEL, dtype: int64

KBA13_KW_30
0.006926295266525814
1.0    554887
2.0    142986
3.0     87548
Name: KBA13_KW_30, dtype: int64

KBA13_KRSAQUOT
0.006926295266525814
3.0    324094
2.0    172621
4.0    142200
1.0     91812
5.0     54634
0.0        60
Name: KBA13_KRSAQUOT, dtype: int64

KBA13_NISSAN
0.006926295266525814
3.0    335457
4.0    167124
2.0    160213
5.0     71427
1.0     51200
Name: KBA13_NISSAN, dtype: int64

KBA13_KRSHERST_BMW_BENZ
0.006926295266525814
3.0    345264
4.0    165044
2.0    153245
5.0     74315
1.0     47495
0.0        58
Name: KBA13_KRSHERST_BMW_BENZ, dtype: int64

KBA13_KRSHERST_FORD_OPEL
0.006926295266525814
3.0    329240
4.0    166035
2.0    158597
1.0     65801
5.0     65690
0.0        58
Name: KBA13_KRSHERST_FORD_OPEL, dtype: int64

KBA13_KRSSEG_KLEIN
0.006926295266525814
2.0    718366
1.0     35557
3.0     31418
0.0        80
Name: KBA13_KRSSEG_KLEIN, dtype: int64

KBA13_KRSSEG_OBER
0.006926295266525814
2.0    516676
1.0    151665
3.0    116737
0.0       343
Name: KBA13_KRSSEG_OBER, dtype: int64

KBA13_KRSZUL_NEU
0.006926295266525814
2.0    378304
1.0    222157
3.0    153623
0.0     31337
Name: KBA13_KRSZUL_NEU, dtype: int64

KBA13_KW_0_60
0.006926295266525814
3.0    357321
2.0    165557
4.0    159703
1.0     54326
5.0     48514
Name: KBA13_KW_0_60, dtype: int64

KBA13_KW_110
0.006926295266525814
3.0    275679
2.0    175418
0.0    124216
4.0     83780
1.0     63943
5.0     62385
Name: KBA13_KW_110, dtype: int64

KBA13_KW_120
0.006926295266525814
3.0    281415
1.0    226709
4.0     94774
0.0     85028
5.0     73739
2.0     23756
Name: KBA13_KW_120, dtype: int64

KBA13_KW_121
0.006926295266525814
3.0    282845
2.0    135033
1.0    105706
0.0     95195
4.0     88983
5.0     77659
Name: KBA13_KW_121, dtype: int64

KBA13_KRSHERST_AUDI_VW
0.006926295266525814
3.0    337010
2.0    167559
4.0    162124
1.0     65798
5.0     52872
0.0        58
Name: KBA13_KRSHERST_AUDI_VW, dtype: int64

KBA13_MERCEDES
0.006926295266525814
3.0    340379
4.0    178995
2.0    134513
5.0     82948
1.0     48586
Name: KBA13_MERCEDES, dtype: int64

KBA13_KW_80
0.006926295266525814
3.0    269221
2.0    180524
0.0    129268
4.0     77562
1.0     77214
5.0     51632
Name: KBA13_KW_80, dtype: int64

KBA13_KW_40
0.006926295266525814
3.0    283084
2.0    135456
1.0    121060
0.0     99260
4.0     85040
5.0     61521
Name: KBA13_KW_40, dtype: int64

KBA13_MAZDA
0.006926295266525814
3.0    343060
4.0    169148
2.0    156989
5.0     71832
1.0     44392
Name: KBA13_MAZDA, dtype: int64

KBA13_KW_90
0.006926295266525814
3.0    277407
2.0    181685
0.0    133326
4.0     82747
5.0     58683
1.0     51573
Name: KBA13_KW_90, dtype: int64

KBA13_MOTOR
0.006926295266525814
3.0    474886
2.0    144655
4.0    102786
1.0     63094
Name: KBA13_MOTOR, dtype: int64

KBA13_KW_70
0.006926295266525814
3.0    276717
2.0    184387
0.0    141548
4.0     78915
5.0     54143
1.0     49711
Name: KBA13_KW_70, dtype: int64

KBA13_KW_61_120
0.006926295266525814
3.0    360881
2.0    164699
4.0    161381
5.0     49311
1.0     49149
Name: KBA13_KW_61_120, dtype: int64

KBA13_KW_60
0.006926295266525814
3.0    267832
2.0    178524
0.0    132144
1.0     79264
4.0     78418
5.0     49239
Name: KBA13_KW_60, dtype: int64

KBA13_KW_50
0.006926295266525814
3.0    273920
2.0    181545
0.0    143215
4.0     81007
5.0     55954
1.0     49780
Name: KBA13_KW_50, dtype: int64

CJT_TYP_3
0.0059274319476949645
5.0    210898
2.0    169832
3.0    158455
4.0    150250
1.0     96776
Name: CJT_TYP_3, dtype: int64

ONLINE_AFFINITAET
0.0059274319476949645
4.0    153466
3.0    152379
1.0    147129
2.0    142482
5.0    128511
0.0     62244
Name: ONLINE_AFFINITAET, dtype: int64

CJT_TYP_2
0.0059274319476949645
2.0    192233
5.0    174571
3.0    161408
4.0    143254
1.0    114745
Name: CJT_TYP_2, dtype: int64

CJT_TYP_1
0.0059274319476949645
5.0    206190
2.0    184857
3.0    160113
4.0    152607
1.0     82444
Name: CJT_TYP_1, dtype: int64

CJT_KATALOGNUTZER
0.0059274319476949645
5.0    219354
4.0    162788
1.0    158125
3.0    146420
2.0     99524
Name: CJT_KATALOGNUTZER, dtype: int64

CJT_GESAMTTYP
0.0059274319476949645
4.0    196312
3.0    145773
2.0    140526
5.0    110409
6.0    100858
1.0     92333
Name: CJT_GESAMTTYP, dtype: int64

RT_SCHNAEPPCHEN
0.0059274319476949645
5.0    331237
4.0    170386
3.0    125685
2.0    108665
1.0     50238
Name: RT_SCHNAEPPCHEN, dtype: int64

RT_KEIN_ANREIZ
0.0059274319476949645
5.0    199009
3.0    174094
4.0    150461
2.0    131430
1.0    131217
Name: RT_KEIN_ANREIZ, dtype: int64

RETOURTYP_BK_S
0.0059274319476949645
5.0    279838
3.0    172606
4.0    121862
1.0    121857
2.0     90048
Name: RETOURTYP_BK_S, dtype: int64

LP_LEBENSPHASE_GROB
0.0059274319476949645
2.0     148686
1.0     130441
3.0     107479
12.0     69149
4.0      50904
5.0      46411
9.0      45801
0.0      42096
10.0     37931
11.0     30879
8.0      28275
6.0      27157
7.0      21002
Name: LP_LEBENSPHASE_GROB, dtype: int64

CJT_TYP_4
0.0059274319476949645
5.0    196489
3.0    170528
2.0    167255
4.0    158108
1.0     93831
Name: CJT_TYP_4, dtype: int64

LP_STATUS_GROB
0.0059274319476949645
1.0    316293
2.0    169482
4.0    151937
5.0    110749
3.0     37750
Name: LP_STATUS_GROB, dtype: int64

LP_STATUS_FEIN
0.0059274319476949645
1.0     205654
9.0     134539
10.0    110749
2.0     110639
4.0      73713
3.0      68616
6.0      28619
5.0      27153
8.0      17398
7.0       9131
Name: LP_STATUS_FEIN, dtype: int64

LP_LEBENSPHASE_FEIN
0.0059274319476949645
1.0     58419
5.0     52231
0.0     44912
6.0     42915
2.0     36997
8.0     28653
7.0     24887
29.0    24780
11.0    24730
13.0    24661
10.0    23742
31.0    22219
12.0    21829
30.0    21021
15.0    18901
3.0     18588
19.0    18283
37.0    17341
4.0     16437
20.0    16224
14.0    16209
32.0    15712
39.0    15099
40.0    14095
16.0    13567
27.0    13510
38.0    12995
35.0    12916
34.0    12262
21.0    11931
9.0     11928
28.0    11423
24.0    11261
25.0     9741
36.0     9619
23.0     8434
22.0     6792
18.0     6470
33.0     5701
17.0     5434
26.0     3342
Name: LP_LEBENSPHASE_FEIN, dtype: int64

LP_FAMILIE_GROB
0.0059274319476949645
1.0    398655
5.0    187174
2.0     97487
4.0     49277
3.0     27157
0.0     26461
Name: LP_FAMILIE_GROB, dtype: int64

LP_FAMILIE_FEIN
0.0059274319476949645
1.0     398655
10.0    128210
2.0      97487
11.0     48569
0.0      26461
8.0      21642
7.0      19341
4.0      11471
5.0      11107
9.0      10395
6.0       8294
3.0       4579
Name: LP_FAMILIE_FEIN, dtype: int64

CJT_TYP_6
0.0059274319476949645
5.0    212823
4.0    181232
2.0    163088
3.0    158469
1.0     70599
Name: CJT_TYP_6, dtype: int64

CJT_TYP_5
0.0059274319476949645
5.0    210520
3.0    182983
2.0    164079
4.0    136255
1.0     92374
Name: CJT_TYP_5, dtype: int64

GFK_URLAUBERTYP
0.0059274319476949645
12.0    128231
10.0    102289
8.0      82420
11.0     74545
5.0      70266
4.0      60175
9.0      56606
3.0      53023
1.0      50485
2.0      42246
7.0      40303
6.0      25622
Name: GFK_URLAUBERTYP, dtype: int64

CAMEO_DEUG_2015
0.0054280002882795405
8.0    133973
9.0    107569
6.0    105359
4.0    103144
3.0     85809
2.0     82616
7.0     77413
5.0     54720
1.0     36003
Name: CAMEO_DEUG_2015, dtype: int64

CAMEO_INTL_2015
0.0054280002882795405
51.0    133210
41.0     91839
24.0     90578
14.0     62442
43.0     56406
54.0     45136
25.0     39382
22.0     32863
23.0     26130
13.0     26122
45.0     26007
55.0     23665
52.0     20500
31.0     18714
34.0     18390
15.0     16901
44.0     14730
12.0     13154
35.0     10308
32.0     10288
33.0      9841
Name: CAMEO_INTL_2015, dtype: int64

STRUKTURTYP
0.005178916650545771
3.0    551184
1.0    126135
2.0    109484
Name: STRUKTURTYP, dtype: int64

ORTSGR_KLS9
0.005105582381568317
5.0    146822
4.0    113780
7.0    102054
9.0     91071
3.0     82630
6.0     75485
8.0     72300
2.0     62577
1.0     40084
0.0        58
Name: ORTSGR_KLS9, dtype: int64

RELAT_AB
0.005105582381568317
3.0    271665
5.0    173767
1.0    141103
2.0    103796
4.0     96376
9.0       154
Name: RELAT_AB, dtype: int64

ARBEIT
0.005105582381568317
4.0    309181
3.0    252686
2.0    134011
1.0     55930
5.0     34899
9.0       154
Name: ARBEIT, dtype: int64

UMFELD_JUNG
0.004969028915196504
5.0    346843
4.0    224711
3.0    129667
2.0     53070
1.0     32678
Name: UMFELD_JUNG, dtype: int64

UMFELD_ALT
0.004969028915196504
4.0    226880
3.0    207742
5.0    132998
2.0    120493
1.0     98856
Name: UMFELD_ALT, dtype: int64

ANZ_HH_TITEL
0.004722474045358509
0.0     763384
1.0      20011
2.0       2426
3.0        579
4.0        231
5.0        117
6.0        105
8.0         68
7.0         64
9.0         34
13.0        29
12.0        22
11.0        22
14.0        16
10.0        16
17.0        13
20.0         9
15.0         7
18.0         6
23.0         3
16.0         2
Name: ANZ_HH_TITEL, dtype: int64

VK_ZG11
0.0029472789825249496
10.0    95045
5.0     94368
6.0     85498
7.0     85483
4.0     83562
8.0     81139
9.0     78982
3.0     67733
2.0     58296
1.0     50583
11.0     7879
Name: VK_ZG11, dtype: int64

VK_DISTANZ
0.0029472789825249496
10.0    91357
8.0     88609
9.0     86009
7.0     81103
6.0     80579
11.0    77509
3.0     69135
12.0    58010
1.0     43692
5.0     37566
4.0     28934
13.0    26762
2.0     19303
Name: VK_DISTANZ, dtype: int64

VK_DHT4A
0.0029472789825249496
10.0    110783
7.0      93698
9.0      85398
8.0      83482
3.0      79793
6.0      77295
2.0      71798
5.0      67984
4.0      67669
1.0      48214
11.0      2454
Name: VK_DHT4A, dtype: int64

INNENSTADT
0.0007270207700351119
5.0    146230
4.0    132908
6.0    110539
2.0    108241
3.0     92133
8.0     81930
7.0     66798
1.0     51545
Name: INNENSTADT, dtype: int64

BALLRAUM
0.0007270207700351119
6.0    252741
1.0    150622
2.0    103705
7.0     98044
3.0     72587
4.0     60728
5.0     51897
Name: BALLRAUM, dtype: int64

EWDICHTE
0.0007270207700351119
6.0    199608
5.0    159943
2.0    137658
4.0    129555
1.0     83009
3.0     80551
Name: EWDICHTE, dtype: int64

KONSUMNAEHE
7.333426897745477e-05
1.0    186953
3.0    165405
5.0    149202
2.0    130347
4.0    129145
6.0     25716
7.0      4073
Name: KONSUMNAEHE, dtype: int64

ANZ_STATISTISCHE_HAUSHALTE
4.9310973967598894e-05
1.0      214998
2.0      120713
3.0       61101
4.0       44592
5.0       39903
6.0       38298
7.0       36449
8.0       32729
9.0       28394
10.0      24012
11.0      19393
12.0      16249
13.0      13010
14.0      10576
15.0       8906
16.0       7287
17.0       6114
18.0       5194
19.0       4446
20.0       3807
21.0       3407
22.0       3199
23.0       2624
24.0       2468
25.0       2362
26.0       2031
27.0       1970
28.0       1913
29.0       1692
30.0       1649
          ...  
328.0         5
228.0         5
242.0         5
303.0         5
177.0         5
216.0         4
209.0         4
241.0         4
218.0         4
239.0         4
203.0         4
262.0         4
198.0         4
309.0         4
284.0         4
371.0         3
229.0         3
182.0         3
449.0         3
289.0         3
245.0         3
248.0         3
189.0         2
197.0         2
227.0         2
336.0         2
175.0         2
165.0         2
133.0         1
314.0         1
Name: ANZ_STATISTISCHE_HAUSHALTE, Length: 266, dtype: int64

GEBAEUDETYP_RASTER
6.321919739435756e-06
4.0    356851
3.0    203591
5.0    157264
2.0     58418
1.0     14770
Name: GEBAEUDETYP_RASTER, dtype: int64

FIRMENDICHTE
6.321919739435756e-06
4.0    271304
3.0    180177
5.0    157264
2.0    138036
1.0     44113
Name: FIRMENDICHTE, dtype: int64

KONSUMZELLE
6.321919739435756e-06
0.0    603929
1.0    186965
Name: KONSUMZELLE, dtype: int64

D19_TELKO_OFFLINE_DATUM
0.0
10    721077
9      35533
8      18052
5       6125
6       3821
7       3471
4       1133
1        657
2        530
3        500
Name: D19_TELKO_OFFLINE_DATUM, dtype: int64

ANZ_KINDER
0.0
0.0     707286
1.0      53462
2.0      23655
3.0       5224
4.0       1031
5.0        182
6.0         44
7.0         10
9.0          3
11.0         1
8.0          1
Name: ANZ_KINDER, dtype: int64

D19_VERSI_ONLINE_DATUM
0.0
10    783761
9       2658
8       1214
7       1008
5        955
6        664
4        282
2        133
3        124
1        100
Name: D19_VERSI_ONLINE_DATUM, dtype: int64

DSL_FLAG
0.0
1.0    770429
0.0     20470
Name: DSL_FLAG, dtype: int64

EINGEZOGENAM_HH_JAHR
0.0
1994.0    109493
1997.0     64321
2004.0     43940
2015.0     43203
2014.0     41506
2001.0     40287
2008.0     33962
2005.0     33674
2002.0     33646
2012.0     32347
2000.0     31314
1999.0     31060
2007.0     30144
2013.0     29244
1998.0     28699
2011.0     25100
2009.0     22160
1996.0     21429
2003.0     21080
2006.0     20705
2010.0     20576
1995.0     17692
2016.0     12736
1993.0       852
1992.0       549
2018.0       505
1991.0       200
2017.0       189
1990.0       157
1989.0        72
1988.0        27
1987.0        19
1986.0         7
1984.0         1
1971.0         1
1904.0         1
1900.0         1
Name: EINGEZOGENAM_HH_JAHR, dtype: int64

FINANZ_ANLEGER
0.0
1    206589
2    156817
5    153824
4    137014
3    136655
Name: FINANZ_ANLEGER, dtype: int64

ANZ_TITEL
0.0
0.0    787822
1.0      2876
2.0       194
3.0         5
4.0         2
Name: ANZ_TITEL, dtype: int64

ANZ_PERSONEN
0.0
1.0     408666
2.0     188927
3.0      92280
4.0      45880
0.0      32965
5.0      15121
6.0       4723
7.0       1491
8.0        513
9.0        175
10.0        65
11.0        37
12.0        16
13.0        11
14.0         4
21.0         4
15.0         3
20.0         3
22.0         2
38.0         2
37.0         2
23.0         2
17.0         1
40.0         1
18.0         1
45.0         1
16.0         1
35.0         1
31.0         1
Name: ANZ_PERSONEN, dtype: int64

FINANZ_HAUSBAUER
0.0
5    183270
2    164915
3    157009
4    156086
1    129619
Name: FINANZ_HAUSBAUER, dtype: int64

D19_TELKO_ONLINE_DATUM
0.0
10    783020
9       4468
8       1655
7        546
5        483
6        443
4        112
1         65
3         63
2         44
Name: D19_TELKO_ONLINE_DATUM, dtype: int64

ANZ_HAUSHALTE_AKTIV
0.0
1.0      192383
2.0      120197
3.0       62172
4.0       42934
5.0       37553
6.0       35798
7.0       34335
8.0       32150
9.0       28877
10.0      25316
11.0      21895
12.0      17964
13.0      15208
14.0      12575
15.0      10340
16.0       8854
17.0       7257
0.0        6300
18.0       6284
19.0       5433
20.0       4653
21.0       4108
22.0       3720
23.0       3220
24.0       2809
25.0       2613
26.0       2320
27.0       2211
28.0       2026
29.0       1949
          ...  
285.0         4
515.0         4
523.0         4
301.0         4
249.0         4
174.0         4
266.0         4
256.0         4
255.0         4
250.0         4
260.0         4
331.0         4
226.0         3
224.0         3
168.0         3
307.0         3
414.0         3
244.0         3
378.0         3
293.0         3
272.0         3
395.0         3
237.0         2
254.0         2
404.0         2
213.0         2
366.0         1
536.0         1
232.0         1
220.0         1
Name: ANZ_HAUSHALTE_AKTIV, Length: 292, dtype: int64

ALTER_KIND4
0.0
0.0     789730
17.0       218
18.0       209
15.0       166
16.0       156
14.0       132
13.0       116
12.0        57
11.0        46
10.0        39
9.0         15
8.0         14
7.0          1
Name: ALTER_KIND4, dtype: int64

ALTER_KIND3
0.0
0.0     784898
18.0       846
15.0       828
16.0       818
17.0       800
14.0       725
13.0       653
12.0       422
11.0       358
10.0       229
9.0        154
8.0         99
7.0         39
6.0         21
5.0          7
4.0          2
Name: ALTER_KIND3, dtype: int64

FINANZ_MINIMALIST
0.0
3    179155
5    159771
4    157280
2    156945
1    137748
Name: FINANZ_MINIMALIST, dtype: int64

FINANZ_UNAUFFAELLIGER
0.0
1    219921
2    182993
3    159060
5    116678
4    112247
Name: FINANZ_UNAUFFAELLIGER, dtype: int64

ALTER_KIND2
0.0
0.0     762309
18.0      3047
14.0      3011
17.0      3008
15.0      2993
16.0      2920
13.0      2894
12.0      2538
11.0      2368
10.0      1882
9.0       1579
8.0       1141
7.0        599
6.0        384
5.0        148
4.0         63
3.0         12
2.0          3
Name: ALTER_KIND2, dtype: int64

ALTER_KIND1
0.0
0.0     712499
18.0      6512
17.0      6213
8.0       6163
7.0       6024
16.0      5946
15.0      5830
14.0      5798
9.0       5660
13.0      5568
10.0      5495
12.0      5381
11.0      5333
6.0       4688
5.0       1430
4.0       1012
3.0        978
2.0        369
Name: ALTER_KIND1, dtype: int64

FINANZ_VORSORGER
0.0
5    234262
4    191124
3    151823
2    111611
1    102079
Name: FINANZ_VORSORGER, dtype: int64

D19_VERSI_OFFLINE_DATUM
0.0
10    758879
9      17426
8       6816
5       3270
7       1858
6       1544
4        530
2        200
3        198
1        178
Name: D19_VERSI_OFFLINE_DATUM, dtype: int64

D19_VERSI_DATUM
0.0
10    567825
9      77563
8      35108
5      28120
6      22414
7      20335
2      15389
4       9274
1       8479
3       6392
Name: D19_VERSI_DATUM, dtype: int64

D19_VERSI_ANZ_24
0.0
0    680496
1     61230
2     35904
3      8560
4      3932
5       687
6        90
Name: D19_VERSI_ANZ_24, dtype: int64

D19_GESAMT_OFFLINE_DATUM
0.0
10    468794
9     142955
8      67274
5      33684
7      26805
6      21857
4       9270
2       8878
1       6117
3       5265
Name: D19_GESAMT_OFFLINE_DATUM, dtype: int64

D19_VERSAND_ANZ_12
0.0
0    546456
1     93329
2     78812
3     33003
4     28297
5      9373
6      1629
Name: D19_VERSAND_ANZ_12, dtype: int64

D19_TELKO_DATUM
0.0
10    572940
9     114099
8      41009
5      18849
7      17564
6      13141
4       5135
1       2978
2       2733
3       2451
Name: D19_TELKO_DATUM, dtype: int64

D19_VERSAND_ANZ_24
0.0
0    474777
2     90474
1     87383
4     53041
3     46087
5     29260
6      9877
Name: D19_VERSAND_ANZ_24, dtype: int64

D19_VERSAND_DATUM
0.0
10    352545
9      97867
5      75959
1      52009
8      48805
2      39465
6      35888
4      32989
7      31016
3      24356
Name: D19_VERSAND_DATUM, dtype: int64

FINANZ_SPARER
0.0
1    242314
2    147434
5    143150
3    137372
4    120629
Name: FINANZ_SPARER, dtype: int64

D19_TELKO_ANZ_24
0.0
0    728048
1     44959
2     14841
3      1987
4       819
5       192
6        53
Name: D19_TELKO_ANZ_24, dtype: int64

D19_TELKO_ANZ_12
0.0
0    758753
1     24071
2      6707
3       837
4       397
5       100
6        34
Name: D19_TELKO_ANZ_12, dtype: int64

D19_KONSUMTYP_MAX
0.0
8    252488
9    177202
1    139532
2     88619
4     73092
3     59966
Name: D19_KONSUMTYP_MAX, dtype: int64

D19_GESAMT_ONLINE_DATUM
0.0
10    365426
9      83500
5      77134
1      55309
8      41300
2      39843
6      35720
4      34965
7      32320
3      25382
Name: D19_GESAMT_ONLINE_DATUM, dtype: int64

D19_GESAMT_DATUM
0.0
10    271448
9      95369
5      90633
1      73351
2      57691
8      51255
4      42549
6      40585
7      36298
3      31720
Name: D19_GESAMT_DATUM, dtype: int64

D19_VERSI_ANZ_12
0.0
0    723245
1     43434
2     19628
3      3255
4      1163
5       164
6        10
Name: D19_VERSI_ANZ_12, dtype: int64

D19_GESAMT_ANZ_24
0.0
0    418072
2     98427
1     83772
4     71586
3     56504
5     44885
6     17653
Name: D19_GESAMT_ANZ_24, dtype: int64

D19_GESAMT_ANZ_12
0.0
0    494955
1     96189
2     93980
3     44073
4     42013
5     16363
6      3326
Name: D19_GESAMT_ANZ_12, dtype: int64

D19_BANKEN_ONLINE_DATUM
0.0
10    632269
9      63873
8      22173
5      21415
7      15705
6      13160
1       6689
4       6636
2       4778
3       4201
Name: D19_BANKEN_ONLINE_DATUM, dtype: int64

D19_BANKEN_OFFLINE_DATUM
0.0
10    771814
8       6282
9       5095
5       4073
2       1985
6        497
1        460
7        323
4        303
3         67
Name: D19_BANKEN_OFFLINE_DATUM, dtype: int64

D19_BANKEN_DATUM
0.0
10    585200
9      79930
8      31979
5      28543
7      19752
6      16544
1       8205
4       8121
2       7718
3       4907
Name: D19_BANKEN_DATUM, dtype: int64

D19_BANKEN_ANZ_24
0.0
0    697109
1     42093
2     28121
3      9831
4      8721
5      3771
6      1253
Name: D19_BANKEN_ANZ_24, dtype: int64

D19_BANKEN_ANZ_12
0.0
0    733405
1     28784
2     17479
3      5511
4      3928
5      1435
6       357
Name: D19_BANKEN_ANZ_12, dtype: int64

D19_VERSAND_OFFLINE_DATUM
0.0
10    542007
9     120274
8      56148
5      21974
7      19143
6      14947
4       5217
2       4676
1       3354
3       3159
Name: D19_VERSAND_OFFLINE_DATUM, dtype: int64

D19_VERSAND_ONLINE_DATUM
0.0
10    407501
9      80081
5      68395
1      48028
8      37126
2      35601
6      32975
4      30428
7      28435
3      22329
Name: D19_VERSAND_ONLINE_DATUM, dtype: int64

FINANZTYP
0.0
6    288329
1    195439
5    105442
2    103661
4     55573
3     42455
Name: FINANZTYP, dtype: int64

EINGEFUEGT_AM_DAY
0.0
9092    383034
9090    191957
7999     11176
4034      6207
4793      6048
8707      3195
4279      2339
6080      2323
4645      2286
9081      2065
4149      2051
8368      2025
7747      2016
7754      1930
8503      1818
8502      1481
8676      1298
8675      1292
8055      1169
7605      1149
7746      1139
7837      1113
4000      1096
4282      1093
4794      1051
8473      1043
7542      1020
7823      1004
8501      1002
7810       987
         ...  
3378         1
1843         1
6967         1
3893         1
4407         1
2349         1
2347         1
1818         1
2341         1
4377         1
2842         1
3866         1
6942         1
5916         1
3873         1
3363         1
2627         1
7968         1
1601         1
4395         1
6439         1
1829         1
6437         1
3367         1
4163         1
1830         1
3879         1
8483         1
1599         1
4098         1
Name: EINGEFUEGT_AM_DAY, Length: 4473, dtype: int64

GEBAEUDETYP
0.0
1.0    455317
3.0    177200
8.0    152081
2.0      4797
4.0       887
6.0       616
5.0         1
Name: GEBAEUDETYP, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_BEKLEIDUNG_GEH
0.0
0    780989
1      9910
Name: D19_LETZTER_KAUF_BRANCHE_D19_BEKLEIDUNG_GEH, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_BILDUNG
0.0
0    789957
1       942
Name: D19_LETZTER_KAUF_BRANCHE_D19_BILDUNG, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_BIO_OEKO
0.0
0    789721
1      1178
Name: D19_LETZTER_KAUF_BRANCHE_D19_BIO_OEKO, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_BUCH_CD
0.0
0    763046
1     27853
Name: D19_LETZTER_KAUF_BRANCHE_D19_BUCH_CD, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_DIGIT_SERV
0.0
0    787434
1      3465
Name: D19_LETZTER_KAUF_BRANCHE_D19_DIGIT_SERV, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_DROGERIEARTIKEL
0.0
0    767670
1     23229
Name: D19_LETZTER_KAUF_BRANCHE_D19_DROGERIEARTIKEL, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_ENERGIE
0.0
0    779174
1     11725
Name: D19_LETZTER_KAUF_BRANCHE_D19_ENERGIE, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_FREIZEIT
0.0
0    783883
1      7016
Name: D19_LETZTER_KAUF_BRANCHE_D19_FREIZEIT, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_GARTEN
0.0
0    789314
1      1585
Name: D19_LETZTER_KAUF_BRANCHE_D19_GARTEN, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_HANDWERK
0.0
0    788745
1      2154
Name: D19_LETZTER_KAUF_BRANCHE_D19_HANDWERK, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_HAUS_DEKO
0.0
0    770717
1     20182
Name: D19_LETZTER_KAUF_BRANCHE_D19_HAUS_DEKO, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_KINDERARTIKEL
0.0
0    783890
1      7009
Name: D19_LETZTER_KAUF_BRANCHE_D19_KINDERARTIKEL, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_KOSMETIK
0.0
0    790133
1       766
Name: D19_LETZTER_KAUF_BRANCHE_D19_KOSMETIK, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_LEBENSMITTEL
0.0
0    784652
1      6247
Name: D19_LETZTER_KAUF_BRANCHE_D19_LEBENSMITTEL, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_LOTTO
0.0
0    790096
1       803
Name: D19_LETZTER_KAUF_BRANCHE_D19_LOTTO, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_NAHRUNGSERGAENZUNG
0.0
0    786947
1      3952
Name: D19_LETZTER_KAUF_BRANCHE_D19_NAHRUNGSERGAENZUNG, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_RATGEBER
0.0
0    786126
1      4773
Name: D19_LETZTER_KAUF_BRANCHE_D19_RATGEBER, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_REISEN
0.0
0    787880
1      3019
Name: D19_LETZTER_KAUF_BRANCHE_D19_REISEN, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_SAMMELARTIKEL
0.0
0    788537
1      2362
Name: D19_LETZTER_KAUF_BRANCHE_D19_SAMMELARTIKEL, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_SCHUHE
0.0
0    759525
1     31374
Name: D19_LETZTER_KAUF_BRANCHE_D19_SCHUHE, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_SONSTIGE
0.0
0    747633
1     43266
Name: D19_LETZTER_KAUF_BRANCHE_D19_SONSTIGE, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_TECHNIK
0.0
0    784130
1      6769
Name: D19_LETZTER_KAUF_BRANCHE_D19_TECHNIK, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_TELKO_MOBILE
0.0
0    776933
1     13966
Name: D19_LETZTER_KAUF_BRANCHE_D19_TELKO_MOBILE, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_TELKO_REST
0.0
0    779783
1     11116
Name: D19_LETZTER_KAUF_BRANCHE_D19_TELKO_REST, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_TIERARTIKEL
0.0
0    788407
1      2492
Name: D19_LETZTER_KAUF_BRANCHE_D19_TIERARTIKEL, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_UNBEKANNT
0.0
0    601284
1    189615
Name: D19_LETZTER_KAUF_BRANCHE_D19_UNBEKANNT, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_VERSAND_REST
0.0
0    765728
1     25171
Name: D19_LETZTER_KAUF_BRANCHE_D19_VERSAND_REST, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_VERSICHERUNGEN
0.0
0    735009
1     55890
Name: D19_LETZTER_KAUF_BRANCHE_D19_VERSICHERUNGEN, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_VOLLSORTIMENT
0.0
0    757274
1     33625
Name: D19_LETZTER_KAUF_BRANCHE_D19_VOLLSORTIMENT, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_WEIN_FEINKOST
0.0
0    788799
1      2100
Name: D19_LETZTER_KAUF_BRANCHE_D19_WEIN_FEINKOST, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_BEKLEIDUNG_REST
0.0
0    769925
1     20974
Name: D19_LETZTER_KAUF_BRANCHE_D19_BEKLEIDUNG_REST, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_REST
0.0
0    785828
1      5071
Name: D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_REST, dtype: int64

GREEN_AVANTGARDE
0.0
0    617034
1    173865
Name: GREEN_AVANTGARDE, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_LOKAL
0.0
0    789502
1      1397
Name: D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_LOKAL, dtype: int64

HH_EINKOMMEN_SCORE
0.0
6.0    252121
5.0    200535
4.0    137979
3.0     82523
2.0     64739
1.0     53002
Name: HH_EINKOMMEN_SCORE, dtype: int64

KBA05_MODTEMP
0.0
3.0    267003
4.0    226175
1.0    151572
2.0     77548
5.0     59479
6.0      9122
Name: KBA05_MODTEMP, dtype: int64

MIN_GEBAEUDEJAHR_ENG
0.0
25.0    567772
23.0     78545
24.0     25434
22.0     25389
21.0     16551
20.0     14383
17.0      7252
26.0      5805
16.0      5776
12.0      5276
27.0      4396
18.0      4356
15.0      4130
19.0      4057
14.0      3288
13.0      2875
28.0      2043
9.0       2037
10.0      2008
8.0       1855
11.0      1843
6.0       1581
7.0       1248
5.0       1237
29.0      1025
30.0       468
31.0       125
32.0       101
4.0         43
Name: MIN_GEBAEUDEJAHR_ENG, dtype: int64

KOMBIALTER
0.0
4    264472
3    238353
2    177180
1     91554
9     19340
Name: KOMBIALTER, dtype: int64

MOBI_RASTER
0.0
1.0    353127
3.0    123116
2.0    117391
4.0     91992
5.0     80251
6.0     25022
Name: MOBI_RASTER, dtype: int64

OST_WEST_KZ
0.0
1    623637
0    167262
Name: OST_WEST_KZ, dtype: int64

SEMIO_DOM
0.0
5    177420
7    160885
4    124091
6     97252
3     94024
2     92770
1     44457
Name: SEMIO_DOM, dtype: int64

SEMIO_ERL
0.0
4    188654
7    174707
6    134815
3    103311
2     76861
5     73514
1     39037
Name: SEMIO_ERL, dtype: int64

SEMIO_FAM
0.0
2    139068
4    132558
5    131168
7    114449
6    104917
3     94494
1     74245
Name: SEMIO_FAM, dtype: int64

SEMIO_KAEM
0.0
3    176447
7    129373
5    127218
6    126200
2    108681
4     76154
1     46826
Name: SEMIO_KAEM, dtype: int64

SEMIO_KRIT
0.0
5    152344
4    143328
7    132689
6    132652
3    125358
2     53886
1     50642
Name: SEMIO_KRIT, dtype: int64

SEMIO_KULT
0.0
5    171378
3    130095
1    121277
7    113347
6    101193
4     97445
2     56164
Name: SEMIO_KULT, dtype: int64

SEMIO_LUST
0.0
6    158035
7    157293
2    105323
1    104870
4     94205
5     89159
3     82014
Name: SEMIO_LUST, dtype: int64

SEMIO_MAT
0.0
4    155970
2    130218
3    123259
7    107353
1     97057
5     94966
6     82076
Name: SEMIO_MAT, dtype: int64

SEMIO_PFLICHT
0.0
4    145861
3    133314
5    121126
7    115272
6    109362
2     91989
1     73975
Name: SEMIO_PFLICHT, dtype: int64

SEMIO_RAT
0.0
4    249265
2    140021
3    131668
7     86654
5     83547
6     56414
1     43330
Name: SEMIO_RAT, dtype: int64

SEMIO_REL
0.0
4    198755
3    150689
7    134890
1    103745
5     74149
2     69660
6     59011
Name: SEMIO_REL, dtype: int64

SEMIO_SOZ
0.0
2    160949
6    136053
5    121379
7    113347
3    110822
4     86510
1     61839
Name: SEMIO_SOZ, dtype: int64

SEMIO_TRADV
0.0
4    169007
3    147749
2    132087
5    113347
1     90115
7     73514
6     65080
Name: SEMIO_TRADV, dtype: int64

SEMIO_VERT
0.0
2    203460
6    141490
5    135020
7    122558
4    112770
1     43808
3     31793
Name: SEMIO_VERT, dtype: int64

SOHO_KZ
0.0
0.0    784250
1.0      6649
Name: SOHO_KZ, dtype: int64

UNGLEICHENN_FLAG
0.0
0.0    719931
1.0     70968
Name: UNGLEICHENN_FLAG, dtype: int64

WOHNDAUER_2008
0.0
9.0    536608
8.0     77667
4.0     47993
3.0     36353
6.0     33702
5.0     29053
7.0     23173
2.0      5739
1.0       611
Name: WOHNDAUER_2008, dtype: int64

WOHNLAGE
0.0
3.0    248153
7.0    168476
4.0    135097
2.0     99783
5.0     73787
1.0     43629
8.0     16594
0.0      5380
Name: WOHNLAGE, dtype: int64

ZABEOTYP
0.0
3    279720
4    206088
1    121965
5     80445
6     70501
2     32180
Name: ZABEOTYP, dtype: int64

ANREDE_KZ
0.0
2    412439
1    378460
Name: ANREDE_KZ, dtype: int64

ALTERSKATEGORIE_GROB
0.0
3    307696
4    221439
2    135823
1    123169
9      2772
Name: ALTERSKATEGORIE_GROB, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_DIREKT
0.0
0    768390
1     22509
Name: D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_DIREKT, dtype: int64

D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_GROSS
0.0
0    780707
1     10192
Name: D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_GROSS, dtype: int64

AKT_DAT_KL
0.0
1.0    378465
9.0    261904
5.0     28011
6.0     26635
3.0     23930
4.0     20579
7.0     20137
8.0     16730
2.0     14508
Name: AKT_DAT_KL, dtype: int64

Proportion of features that need imputation

In [78]:
features_missing.plot(kind="hist", bins=20, title="Histogram of Missing Values Proportions in Features after Cleaning");
plt.tight_layout()
plt.savefig("missing_after_cleaning.png")

We can see now that imputations won't be a big deal, and that there is very small number of features that exceed 10% missing values.

But how shall we impute that data? There are multiple ways we could that including:

  1. Imputing missing values with 0 or -1 which was used originally in the dataset
  2. Imputing missing values with mean, median or mode and adding and indicator (or not) for missing data
  3. KNN imputation
  4. Iterative imputation using linear regression

My gut feeling says that I should go with KNN imputation, as I think that features of the same category could be of huge aid in determining which value to impute with, instead of imputing with 0 or mode (since we are dealing with ordinal data).

But KNN imputation will be time and memory intensive, so we can't do on all features. So we probably should stick with an easy strategy like mode imputation for features below a threshold and KNN for features above.

But what that shall that threshold be?

Since KNN is memory and time intensive, we need the majority of features to be imputed using mode values, and only small number of features shall be imputed using KNN.

In [79]:
# Null percentages in features 
na_features = clean_azdias.isna().sum() / clean_azdias.shape[0]

# Filter out features that have no null values
na_features = na_features[na_features > 0]
In [80]:
na_features.plot(kind="box");
plt.title("Box Plot of Null Percetnages in Features");
plt.ylabel("Percetnage")
plt.tight_layout()
plt.savefig("missing_after_cleaning_box.png");

Using this box plot we can see that we can only impute features above 10% using KNN. So in order to do that we need to first impute features lower than 10% using mode imputation, then impute the rest of the features using KNN.

For that we shall use a ColumnTransformer to specify the mode imputation only to features below 10%, then have that followed by KNN imputer in the pipeline.

Let's test this separately first to ensure that it would work.

In [23]:
# Features below 10% null percentage
features_below_10 = na_features[na_features < 0.1].index

# Features indexes 
features_below_10_idx = np.argwhere(clean_azdias.columns.isin(features_below_10)).ravel()

# Most frequent imputer using ColumnTransformer
# mode_imputer = ColumnTransformer([
#     ('mode', SimpleImputer(strategy="most_frequent"), features_below_10_idx)
# ], remainder="passthrough")
In [24]:
%%time

# mode_azdias = mode_imputer.fit_transform(clean_azdias)
# mode_azdias = pd.DataFrame(mode_azdias, index=clean_azdias.index, columns=clean_azdias.columns)
CPU times: user 2 µs, sys: 2 µs, total: 4 µs
Wall time: 8.82 µs
In [25]:
# Features above 10% null percentage
features_above_10 = na_features[na_features > 0.1].index

# Features indexes 
features_above_10_idx = np.argwhere(clean_azdias.columns.isin(features_above_10)).ravel()

# Check number of features with missing value equals features above 10% 
# mode_azdias.isna().any(axis=0).sum() == len(features_above_10)
In [26]:
# Check dtype
# mode_azdias.info()
In [27]:
# Change dtype to float32
# mode_azdias = mode_azdias.astype(np.float32)
In [28]:
# %%time

# # Impute the rest of the features using KNN Imputer
# knn_imputer = KNNImputer()
# mode_azdias = knn_imputer.fit_transform(mode_azdias)
# mode_azdias = pd.DataFrame(mode_azdias, index=clean_azdias.index, columns=clean_azdias.columns)

After several trials, it seems that KNN takes a really long time and it's not memory efficient as we'll have to copy the whole dataset for imputations to the customer dataset.

So instead, I'll impute the remaining 10 features using mean imputation to avoid introducing any spike in the data since we are dealing with features that might have more than 20% missing values.

In [29]:
%%time

imputer = ColumnTransformer([
    ('mode', SimpleImputer(strategy="most_frequent"), features_below_10),
    ('mean', SimpleImputer(strategy="mean"), features_above_10)
], remainder="passthrough")


imputed_azdias = imputer.fit_transform(clean_azdias).astype(np.float32)
CPU times: user 11.9 s, sys: 32.2 s, total: 44.1 s
Wall time: 7min 33s

In order to compare features we need to re-arrange the columns of clean_azdias to be in the same way outputted by ColumnTransformer, which is the first feature list in the pipeline, followed by the second and then the remining features that weren't transformed.

In [30]:
# Imputed features as arranged in ColumnTransformer
imputed_features = list(features_below_10) + list(features_above_10)

# Features that weren't transformed
rem_features = [feat for feat in clean_azdias.columns if feat not in imputed_features]

# New arrangement of features which is outputted by ColumnTransformer
new_features = imputed_features + rem_features

# Convert imputed_azdias into DataFrame object for further analysis
imputed_azdias = pd.DataFrame(imputed_azdias, index=clean_azdias.index, columns=new_features)

Now let's inspect how each imputer affects the features

  1. First we need to find the null features and sort them in order to check how the imputation handled features with high null values.
  2. Then we need to visualize these features before and after imputation.
  3. If the feature has <= 10 unique values we'll visualize them using a bar plot, and we'll visualize the rest using histograms.
In [31]:
# Sort features using null percentages
na_features = na_features.sort_values(ascending=False)

# Loop over features
for feat, p in na_features.iteritems():
    fig, axes = plt.subplots(1, 2, sharey=True, figsize=(15, 4))
    
    # Feature before imputation (with imputing null values to 0 as they won't be visualized otherwise)
    null_feat = clean_azdias[feat].fillna(0)
    
    # Feature after imputation
    imputed_feat = imputed_azdias[feat]
    
    # Plot histogram if feature has more than 11 unique values
    if null_feat.nunique() > 11:
        null_feat.plot(kind="hist", bins=20, color="tab:blue", ax=axes[0])
    # Plot bar otherwise
    else:
        null_feat.value_counts().sort_index().plot(kind="bar", color="tab:blue", ax=axes[0])
    axes[0].set_title("{} - {:.2f}%".format(feat, p*100))
    
    if imputed_feat.nunique() > 11:
        null_feat.plot(kind="hist", bins=20, color="tab:blue", ax=axes[1])
    else:
        imputed_feat.value_counts().sort_index().plot(kind="bar", color="tab:blue", ax=axes[1])        
    axes[1].set_title(f"{feat} After Imputation")  
    
    plt.show()
In [32]:
# save imputed azdias in pickle format
pd.to_pickle(imputed_azdias, "imputed_azdias.pkl")

# save imputer in pickle format
pickle.dump(imputer, open("imputer.pkl", "wb"))

Now we can proceed to the next part

Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

PCA and Cluster Anlaysis of AZDIAS

In [81]:
# load imputed azdias dataset
imputed_azdias = pd.read_pickle("imputed_azdias.pkl")
In [82]:
# load imputer
imputer = pickle.load(open("imputer.pkl", "rb"))
In [10]:
# scaler = StandardScaler()
# pca = IncrementalPCA()

Fitting StandardScaler to the data

In [11]:
# %%time 

# scaler.fit(imputed_azdias)

After trying to fit the scaler on the whole data, it turns out that it takes a really long time, and that we are better off fitting on batches of the dataset. So let's make a function that makes fitting any sklearn transformer using batches easy.

In [53]:
# scaler = batch_fit_scaler(scaler, imputed_azdias)
100%|██████████| 100/100 [00:05<00:00, 19.12it/s]

Fitting PCA to the data

In [54]:
# pca = batch_fit_pca(pca, scaler, imputed_azdias)
100%|██████████| 100/100 [04:32<00:00,  2.73s/it]

Save PCA and Scaler

In [ ]:
# pickle.dump(scaler, open("scaler.pkl", "wb"))
# pickle.dump(pca, open("pca.pkl", "wb"))

Load PCA and Scaler

In [83]:
scaler = pickle.load(open("scaler.pkl", "rb"))
pca = pickle.load(open("pca.pkl", "rb"))

Finding minimum dimensions explaining 95% of data variance

In [84]:
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
print("Minimum number of dimensions that have 95% of original data variance:", d)
Minimum number of dimensions that have 95% of original data variance: 224

Transforming the data incrementaly as transforming the whole dataset at once takes a long time

In [85]:
# Transform and keep only d dimensions
pca_azdias = batch_transform_pca(pca, scaler, imputed_azdias)[:, :d]
100%|██████████| 100/100 [14:18<00:00, 42.05s/it]

MiniBatchKMeans Clustering

In [86]:
K = range(1, 10)
K_inertia = []

init = "k-means++"
n_init = 10
n_batches = 100
batch_size = pca_azdias.shape[0]//n_batches
seed = 42

for k in K:
    print(f"Running with k = {k}")
    kmeans = MiniBatchKMeans(n_clusters=k, init=init, n_init=n_init, batch_size=batch_size, random_state=seed)
    kmeans.fit(pca_azdias)
    K_inertia.append(kmeans.inertia_)
    print(f"Inertia: {kmeans.inertia_}\n")
Running with k = 1
Inertia: 265996686.15559685

Running with k = 2
Inertia: 250395368.02890292

Running with k = 3
Inertia: 245384193.8051987

Running with k = 4
Inertia: 239043764.56472105

Running with k = 5
Inertia: 235703097.15493846

Running with k = 6
Inertia: 233264450.79078335

Running with k = 7
Inertia: 230226030.7373761

Running with k = 8
Inertia: 228191643.24432597

Running with k = 9
Inertia: 226508461.52140844

Now that we know the intertia for each k clusters, we can plot inertia as a function of k and use the elbow method to determine k which shows significant improvement in inertia, and at which any increase in clusters shows marginal improvement.

In [87]:
plt.plot(K, K_inertia, linestyle='-', marker='o')
plt.title('Inertia as a Function of k Clusters')
plt.xlabel('k Clusters')
plt.ylabel('Inertia')
plt.xticks(K, K)
plt.tight_layout()
plt.savefig("kmean_inertia.png");

As we can see, the more clusters we set, the better the score we get. But this keeps going forever, so we need to find the elbow point, and I think that is 7.

In [88]:
k = 7
init = "k-means++"
n_init = 10
n_batches = 100
batch_size = pca_azdias.shape[0]//n_batches
seed = 42

kmeans = MiniBatchKMeans(n_clusters=k, init=init, n_init=n_init, batch_size=batch_size, random_state=seed)
kmeans.fit(pca_azdias)
Out[88]:
MiniBatchKMeans(batch_size=7908, n_clusters=7, n_init=10, random_state=42)

Save KMeans

In [64]:
pickle.dump(kmeans, open("kmeans.pkl", "wb"))

Load KMeans

In [89]:
kmeans = pickle.load(open("kmeans.pkl", "rb"))

How are individuals in AZDIAS distributed among clusters?

In [90]:
# Calculate percentage of each cluster in azdias
azdias_clusters_p = pd.Series(kmeans.labels_).value_counts()/len(kmeans.labels_)

# Plot the azdias clusters percentages
azdias_clusters_p.plot(kind="bar");
plt.title("Clusters of AZDIAS")
plt.xlabel("Cluster")
plt.ylabel("Count");

How are individuals in CUSTOMERS distribued among clusters?

In [91]:
# Clean the customers dataset using the pipeline we made earlier
clean_customers = clean_dataset(customers, keep_features=["KBA13_ANTG4"])
In [92]:
# Re-arrange columns to be just as clean_azdias to avoid any problems with imputer
clean_customers = clean_customers[clean_azdias.columns]
In [93]:
# Impute missing values using imputer fitted on clean_azdias
imputed_customers = imputer.transform(clean_customers)
In [94]:
# transform using scaler
scaled_customers = scaler.transform(imputed_customers)
In [95]:
# transform using pca
pca_customers = pca.transform(scaled_customers)[:, :d]
In [96]:
# Predict KMeans labels
customers_clusters = kmeans.predict(pca_customers)
In [97]:
# Calculate percentage of each cluster in customers
customers_clusters_p = pd.Series(customers_clusters).value_counts()/len(customers_clusters)

# Plot the customers clusters percentages
customers_clusters_p.plot(kind="bar");
plt.title("Clusters of CUSTOMERS")
plt.xlabel("Cluster")
plt.ylabel("Percentage");

How does cluster percentages differ between AZDIAS and CUSTOMERS?

This question is better answered using a bar chart with hue of whether the cluster belongs to AZDIAS or CUSTOMERS. And in order to do that we need to make a dataframe that has 1 column for AZDIAS and another for CUSTOMERS.

In [153]:
# Concatenate azdias and customers cluster percentages
clusters_p = pd.concat([azdias_clusters_p, customers_clusters_p], axis=1)*100

# Rename columns to AZDIAS and CUSTOMERS
clusters_p.columns = ["AZDIAS", "CUSTOMERS"]

clusters_p.sort_values(["CUSTOMERS", "AZDIAS"], ascending=False).plot(kind="bar", color=["grey", "tab:blue"])
plt.title("How Are Clusters Represented in Customer and General Population?")
plt.xlabel("Cluster")
plt.ylabel("Percentage %")
plt.tight_layout()
plt.savefig("clusters.png");
  1. We can see an over represenation of cluster 0 in CUSTOMERS when compared to AZDIAS, with over 40% of CUSTOMERS being in this cluster.
  2. Clusters 4 and 6 percentages in CUSTOMERS also exceed their counterparts in AZDIAS.
  3. The most rare cluster in CUSTOMERS is cluster 5, followed by 3, 2 and 1.
  4. Therefore, the clusters that are more generally inclined to be customers are 0, 4 and 6.
  5. While the clusters that are less inclined to be customers are 1, 2, 3, and 5.

Right now we should look into these clusters that are more likely to be customers in the general population to better understand them.

Analyzing Cluster 0

In [99]:
# Filtering individuals from cluster 0 in AZDIAS
cluster_0 = clean_azdias[kmeans.labels_ == 0]
In [100]:
# Checking the dimensions of the dataframe
cluster_0.shape
Out[100]:
(134644, 354)
In [101]:
# Taking a quick look at the statistics of the cluster
cluster_0.describe()
Out[101]:
AKT_DAT_KL ALTER_HH ALTER_KIND1 ALTER_KIND2 ALTER_KIND3 ALTER_KIND4 ALTERSKATEGORIE_FEIN ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL ANZ_KINDER ANZ_PERSONEN ANZ_STATISTISCHE_HAUSHALTE ANZ_TITEL ARBEIT BALLRAUM CAMEO_DEUG_2015 CAMEO_INTL_2015 CJT_GESAMTTYP CJT_KATALOGNUTZER CJT_TYP_1 CJT_TYP_2 CJT_TYP_3 CJT_TYP_4 CJT_TYP_5 CJT_TYP_6 D19_BANKEN_ANZ_12 D19_BANKEN_ANZ_24 D19_BANKEN_DATUM D19_BANKEN_OFFLINE_DATUM D19_BANKEN_ONLINE_DATUM D19_BANKEN_ONLINE_QUOTE_12 D19_GESAMT_ANZ_12 D19_GESAMT_ANZ_24 D19_GESAMT_DATUM D19_GESAMT_OFFLINE_DATUM D19_GESAMT_ONLINE_DATUM D19_GESAMT_ONLINE_QUOTE_12 D19_KONSUMTYP D19_KONSUMTYP_MAX D19_TELKO_ANZ_12 D19_TELKO_ANZ_24 D19_TELKO_DATUM D19_TELKO_OFFLINE_DATUM D19_TELKO_ONLINE_DATUM D19_VERSAND_ANZ_12 D19_VERSAND_ANZ_24 D19_VERSAND_DATUM D19_VERSAND_OFFLINE_DATUM D19_VERSAND_ONLINE_DATUM D19_VERSAND_ONLINE_QUOTE_12 D19_VERSI_ANZ_12 D19_VERSI_ANZ_24 D19_VERSI_DATUM D19_VERSI_OFFLINE_DATUM D19_VERSI_ONLINE_DATUM DSL_FLAG EINGEZOGENAM_HH_JAHR EWDICHTE FINANZ_ANLEGER FINANZ_HAUSBAUER ... SEMIO_RAT SEMIO_REL SEMIO_SOZ SEMIO_TRADV SEMIO_VERT SHOPPER_TYP SOHO_KZ STRUKTURTYP UMFELD_ALT UMFELD_JUNG UNGLEICHENN_FLAG VERDICHTUNGSRAUM VERS_TYP VHN VK_DHT4A VK_DISTANZ VK_ZG11 W_KEIT_KIND_HH WOHNDAUER_2008 WOHNLAGE ZABEOTYP ANREDE_KZ ALTERSKATEGORIE_GROB D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_DIREKT D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_GROSS D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_LOKAL D19_LETZTER_KAUF_BRANCHE_D19_BANKEN_REST D19_LETZTER_KAUF_BRANCHE_D19_BEKLEIDUNG_GEH D19_LETZTER_KAUF_BRANCHE_D19_BEKLEIDUNG_REST D19_LETZTER_KAUF_BRANCHE_D19_BILDUNG D19_LETZTER_KAUF_BRANCHE_D19_BIO_OEKO D19_LETZTER_KAUF_BRANCHE_D19_BUCH_CD D19_LETZTER_KAUF_BRANCHE_D19_DIGIT_SERV D19_LETZTER_KAUF_BRANCHE_D19_DROGERIEARTIKEL D19_LETZTER_KAUF_BRANCHE_D19_ENERGIE D19_LETZTER_KAUF_BRANCHE_D19_FREIZEIT D19_LETZTER_KAUF_BRANCHE_D19_GARTEN D19_LETZTER_KAUF_BRANCHE_D19_HANDWERK D19_LETZTER_KAUF_BRANCHE_D19_HAUS_DEKO D19_LETZTER_KAUF_BRANCHE_D19_KINDERARTIKEL D19_LETZTER_KAUF_BRANCHE_D19_KOSMETIK D19_LETZTER_KAUF_BRANCHE_D19_LEBENSMITTEL D19_LETZTER_KAUF_BRANCHE_D19_LOTTO D19_LETZTER_KAUF_BRANCHE_D19_NAHRUNGSERGAENZUNG D19_LETZTER_KAUF_BRANCHE_D19_RATGEBER D19_LETZTER_KAUF_BRANCHE_D19_REISEN D19_LETZTER_KAUF_BRANCHE_D19_SAMMELARTIKEL D19_LETZTER_KAUF_BRANCHE_D19_SCHUHE D19_LETZTER_KAUF_BRANCHE_D19_SONSTIGE D19_LETZTER_KAUF_BRANCHE_D19_TECHNIK D19_LETZTER_KAUF_BRANCHE_D19_TELKO_MOBILE D19_LETZTER_KAUF_BRANCHE_D19_TELKO_REST D19_LETZTER_KAUF_BRANCHE_D19_TIERARTIKEL D19_LETZTER_KAUF_BRANCHE_D19_UNBEKANNT D19_LETZTER_KAUF_BRANCHE_D19_VERSAND_REST D19_LETZTER_KAUF_BRANCHE_D19_VERSICHERUNGEN D19_LETZTER_KAUF_BRANCHE_D19_VOLLSORTIMENT D19_LETZTER_KAUF_BRANCHE_D19_WEIN_FEINKOST MIN_GEBAEUDEJAHR_ENG EINGEFUEGT_AM_DAY
count 134644.000000 96438.000000 134644.000000 134644.000000 134644.000000 134644.000000 96241.000000 134644.000000 132454.000000 134644.000000 134644.000000 134632.000000 134644.000000 133947.000000 134507.000000 133893.000000 133893.000000 133667.000000 133667.000000 133667.000000 133667.000000 133667.000000 133667.000000 133667.000000 133667.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 112449.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 112449.000000 112449.00000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 112449.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134507.000000 134644.000000 134644.000000 ... 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 131910.000000 134644.000000 133947.000000 133784.000000 133784.000000 134644.000000 45036.000000 131910.000000 124236.000000 134143.000000 134143.000000 134143.000000 124884.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000 134644.000000
mean 3.187309 12.615328 0.350302 0.107758 0.017075 0.003260 11.619549 2.235257 0.009852 0.041413 1.677423 2.155119 0.004865 2.569427 4.824849 3.781213 25.638205 2.659901 3.932362 2.115010 1.784913 4.335543 4.250114 4.388413 4.402613 0.021263 0.062157 9.686581 9.979925 9.792958 0.112264 0.366351 0.739513 7.543507 9.078756 8.543574 1.455602 6.11615 5.715390 0.016146 0.040774 9.676562 9.912116 9.987218 0.243516 0.505102 8.410200 9.309683 8.839280 1.154799 0.064704 0.131755 9.346744 9.966824 9.992580 0.987842 2000.674194 3.033292 1.839837 2.466601 ... 2.983883 2.934368 3.890296 2.799523 4.646356 1.588674 0.008764 2.287711 2.955129 4.376697 0.070363 9.902145 1.482390 2.337881 6.437123 7.362978 5.235555 4.753828 8.457339 4.473464 2.684590 1.469074 3.525356 0.014899 0.010569 0.001894 0.005021 0.017676 0.014416 0.001352 0.002547 0.017112 0.003788 0.020142 0.029285 0.007776 0.003647 0.004345 0.031602 0.004226 0.000616 0.009900 0.001315 0.007835 0.007561 0.004464 0.006246 0.013539 0.073044 0.010197 0.014594 0.015374 0.002703 0.335344 0.015597 0.083539 0.038643 0.004352 23.928232 8518.944929
std 3.393323 3.462417 2.194212 1.256388 0.506017 0.221193 2.599557 2.917603 0.103692 0.250355 1.030734 2.830709 0.074126 0.986467 1.764950 1.922677 11.327308 1.384080 1.306200 0.885509 0.834673 0.875818 0.965004 0.851146 0.867370 0.182249 0.342634 1.005040 0.254498 0.823377 1.053200 0.715816 1.078904 2.589646 1.597698 2.172348 3.468826 2.96851 3.164497 0.140156 0.227035 0.943395 0.485931 0.195821 0.598513 0.919207 2.176648 1.369595 1.990829 3.146736 0.305743 0.454610 1.602997 0.292414 0.166834 0.109591 6.653992 1.437792 1.030203 1.200813 ... 1.207026 1.416085 1.658931 1.256696 1.785728 1.132793 0.093205 0.839537 1.339260 1.022840 0.255759 11.101193 0.499692 1.015760 2.707103 3.117996 2.764276 1.537873 1.420005 2.255901 1.199341 0.499045 0.705871 0.121147 0.102259 0.043478 0.070679 0.131772 0.119198 0.036741 0.050408 0.129688 0.061428 0.140486 0.168604 0.087839 0.060278 0.065772 0.174938 0.064870 0.024821 0.099006 0.036233 0.088171 0.086623 0.066661 0.078785 0.115569 0.260210 0.100466 0.119921 0.123035 0.051924 0.472112 0.123909 0.276696 0.192742 0.065828 2.797748 1333.783703
min 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 12.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 1.00000 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 1900.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000 1424.000000
25% 1.000000 10.000000 0.000000 0.000000 0.000000 0.000000 10.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 2.000000 3.000000 2.000000 14.000000 2.000000 3.000000 1.000000 1.000000 4.000000 4.000000 4.000000 4.000000 0.000000 0.000000 10.000000 10.000000 10.000000 0.000000 0.000000 0.000000 5.000000 9.000000 8.000000 0.000000 3.00000 2.000000 0.000000 0.000000 10.000000 10.000000 10.000000 0.000000 0.000000 8.000000 9.000000 9.000000 0.000000 0.000000 0.000000 10.000000 10.000000 10.000000 1.000000 1994.000000 2.000000 1.000000 2.000000 ... 2.000000 2.000000 2.000000 2.000000 3.000000 1.000000 0.000000 2.000000 2.000000 4.000000 0.000000 1.000000 1.000000 2.000000 5.000000 6.000000 3.000000 4.000000 9.000000 3.000000 1.000000 1.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25.000000 9090.000000
50% 1.000000 12.000000 0.000000 0.000000 0.000000 0.000000 12.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 3.000000 6.000000 4.000000 24.000000 2.000000 4.000000 2.000000 2.000000 5.000000 5.000000 5.000000 5.000000 0.000000 0.000000 10.000000 10.000000 10.000000 0.000000 0.000000 0.000000 9.000000 10.000000 10.000000 0.000000 6.00000 8.000000 0.000000 0.000000 10.000000 10.000000 10.000000 0.000000 0.000000 9.000000 10.000000 10.000000 0.000000 0.000000 0.000000 10.000000 10.000000 10.000000 1.000000 1999.000000 3.000000 2.000000 2.000000 ... 3.000000 3.000000 4.000000 3.000000 5.000000 1.000000 0.000000 3.000000 3.000000 5.000000 0.000000 5.000000 1.000000 2.000000 7.000000 8.000000 5.000000 6.000000 9.000000 4.000000 3.000000 1.000000 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25.000000 9092.000000
75% 6.000000 15.000000 0.000000 0.000000 0.000000 0.000000 14.000000 2.000000 0.000000 0.000000 2.000000 2.000000 0.000000 3.000000 6.000000 5.000000 32.000000 4.000000 5.000000 3.000000 2.000000 5.000000 5.000000 5.000000 5.000000 0.000000 0.000000 10.000000 10.000000 10.000000 0.000000 1.000000 1.000000 10.000000 10.000000 10.000000 0.000000 9.00000 8.000000 0.000000 0.000000 10.000000 10.000000 10.000000 0.000000 1.000000 10.000000 10.000000 10.000000 0.000000 0.000000 0.000000 10.000000 10.000000 10.000000 1.000000 2005.000000 4.000000 2.000000 3.000000 ... 4.000000 4.000000 6.000000 4.000000 6.000000 3.000000 0.000000 3.000000 4.000000 5.000000 0.000000 15.000000 2.000000 3.000000 9.000000 10.000000 7.000000 6.000000 9.000000 7.000000 3.000000 2.000000 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 25.000000 9092.000000
max 9.000000 21.000000 18.000000 18.000000 18.000000 18.000000 25.000000 122.000000 4.000000 6.000000 13.000000 104.000000 2.000000 9.000000 7.000000 9.000000 55.000000 6.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 6.000000 10.000000 10.000000 10.000000 10.000000 6.000000 6.000000 10.000000 10.000000 10.000000 10.000000 9.00000 9.000000 4.000000 4.000000 10.000000 10.000000 10.000000 6.000000 6.000000 10.000000 10.000000 10.000000 10.000000 5.000000 6.000000 10.000000 10.000000 10.000000 1.000000 2018.000000 6.000000 5.000000 5.000000 ... 7.000000 7.000000 7.000000 7.000000 7.000000 3.000000 1.000000 3.000000 5.000000 5.000000 1.000000 45.000000 2.000000 4.000000 11.000000 13.000000 11.000000 6.000000 9.000000 8.000000 6.000000 2.000000 9.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 32.000000 9092.000000

8 rows × 354 columns

As we can see it's hard to get any insight from the statistics, so I think it would be better if were able to visualize all features, so I'll make a function that visualizes all features of a given cluster.

In [102]:
# Make an index for each category in order to sort them during visualization
cats =  ["Person",
         "Household",
         "Building",
         "Community",
         "Postcode",
         "PLZ8",
         "RR1_ID",
         "Microcell (RR3_ID)",         
         "Microcell (RR4_ID)",
         "Unknown"]

cats_order = {cat: i for i, cat in enumerate(cats)}

cats_order
Out[102]:
{'Person': 0,
 'Household': 1,
 'Building': 2,
 'Community': 3,
 'Postcode': 4,
 'PLZ8': 5,
 'RR1_ID': 6,
 'Microcell (RR3_ID)': 7,
 'Microcell (RR4_ID)': 8,
 'Unknown': 9}
In [103]:
feat_cat_tuples = []
known_feats = set(dias_atts.Attribute)

# Match each feature with category and assign unknown if no category exists
for feat in clean_azdias.columns:
    cat = "Unknown"
    if feat in known_feats:
        cat = dias_atts[dias_atts["Attribute"] == feat]["Information level"].item().strip()
    feat_cat_tuples.append((feat, cat))
    
# Make dataframe from resulting tuples of feature and category
feat_cat_df = pd.DataFrame(feat_cat_tuples, columns=["feature", "category"])

# Assign category order for each category
feat_cat_df["cat_order"] = feat_cat_df["category"].map(cats_order)

# Sort by category then alphabeticaly over features
feat_cat_df.sort_values(by=["cat_order", "feature"], inplace=True)

feat_cat_df.head()
Out[103]:
feature category cat_order
316 ALTERSKATEGORIE_GROB Person 0
315 ANREDE_KZ Person 0
17 CJT_GESAMTTYP Person 0
64 FINANZTYP Person 0
58 FINANZ_ANLEGER Person 0
In [104]:
compare_features(dfs=[cluster_0, clean_azdias], labels=["Cluster 0", "AZDIAS"])
/opt/conda/lib/python3.6/site-packages/matplotlib/pyplot.py:523: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Insights about cluster 0:

  1. Age through prename analysis (ALTERSKATEGORIE_GROB) is left skewed, with the majority of individuals in this cluster between 46-60 and over 60 years old.
  2. There is no huge difference in gender between clutser 0 and the general population, however they slightly have more male percentage.
  3. Around 60% of cluster 0 are adverstising and consumption minimalists/traditionalists, while the population shows more of a uniform distribution around the spectrum of the Customer-Journey Typology relating to the preferred information and buying channels of consumers (CJT_GESMATTYP). Therefore, we can that individuals are more likely to be customers given that their buying channels aren't online or local stores.
  4. The majority of financial types of the customers in cluster 0 are money saver and unremarkable with around 80% (FINANZTYP).
  5. Cluster 0 customers tend to be the investing type, where the distribution of the feature (FINANZ_ANLEGER) is skewed towards more investing behavior, while the general population has more of general distribution of the spectrum.
  6. LP_LEBENSPHASE_FEIN (lifestage fine) and LP_LEBENSPHASE_GROB (lifestage rough) difference in distribution between cluster 0 and general population further indicates that the individuals in this cluster are high earning elder individuals, where they have heavy presence over values indicating advanced age and retired individuals, which are either single or married.
  7. LP_STATUS_FEIN (social status fine) and LP_STATUS_GROB (social status rough) indicate that the majority of these individuals are top earners and homeowners, where the percentage of these type of individuals in the cluster is almost 70%, which is more than double of the percentage of these types in the general population (around 30%).
  8. The main age of the household (ALTER_HH) tends to be old, which leans towards a normal distribution centered around the age of 50-60 years old, while the general population if left skewed.
  9. Household features paint the same picture around these indivdiauls, which is that they are typically high-earning elder individuals.
  10. Around 60% of this cluster live far over 50km from the nearest metropole.
  11. The rest of the features don't show any more striking differences, and they don't add new information other than that gained from Person features.

Now that we got a handle of which features we should look into, let's compare the 3 cluster to the general population all at once.

In [105]:
# Filtering individuals from cluster 0 in AZDIAS
cluster_4 = clean_azdias[kmeans.labels_ == 4]
cluster_6 = clean_azdias[kmeans.labels_ == 6]
In [106]:
compare_features([cluster_0, cluster_4, cluster_6, clean_azdias], ["Cluster 0", "Cluster 4", "Cluster 6", "Population"])
/opt/conda/lib/python3.6/site-packages/matplotlib/pyplot.py:523: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Unfortunately this doesn't provide easy information to digest, so I'll resort to analyzing the cluster against the general population dependently again.

In [107]:
compare_features([cluster_4, clean_azdias], ["Cluster 4", "Population"])
/opt/conda/lib/python3.6/site-packages/matplotlib/pyplot.py:523: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Insights about cluster 4:

  1. The majority of cluster 4 are individuals predicted to be between 46-60, while the probability of being over 60, between 30-45 or less than 30 is almost the same.
  2. Unlike cluster 0, individuals in cluster 4 are more advertisment friendly (CJT_GESMATTYP) having the majorty of them in advertising interested and enthusiast categories with either online shops or stores.
  3. The majority of their financial type is unremarkable, followed by low financial interest and main focus is the own house (FINANZTYP).
  4. The vacation habits of this cluster indicate that their individuals are more connected to their families (GFK_URLAUBERTYP).
  5. My guess was correct about this cluster representing families, as the family type feature (LP_FAMILIE_FEIN & LP_FAMILIE_GROB) indicates that this cluster has higher than normal representation of families and multi-generation households.
  6. The same is indicated by life stage featuers (LP_LEBENSPHASE_FEIN & LP_LEBENSPHASE_GROB), where they households of multiple persons whose income ranges between medium and high income.
  7. More than 50% of this cluster are high income and home owning individuals (LP_STATUS_FEIN & LP_STATUS_GROB).
  8. The main age in the household is more skewed towards higher ages than the general population (**ALTER_HH).
  9. Aroun 45% of this cluster live far over 50km from the nearest metropole.
  10. The rest of the feature paint the same picture as the person features.
In [ ]:
compare_features([cluster_6, clean_azdias], ["Cluster 6", "Population"])
/opt/conda/lib/python3.6/site-packages/matplotlib/pyplot.py:523: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Insights about cluster 6:

  1. The majority of ages are above 46 years old.
  2. We can that for the first time this cluster has more females than males.
  3. The preferred information and buyings channels is more diverse than the other two clusters.
  4. Around 40% are investors.
  5. Almost 70% are single.
  6. Around 50% of the cluster live less than 20km from the nearest metropole.
  7. We can see that there is alot of resemblance between this cluster and the two previous cluster, so it we need not scrutnize it as it doesn't provide new information about the customers.

Since we figured out the major characterstics of the customer in each cluster, we can see that these clusters share most of them.

Now we can compare cluster 0 and 5 to get a sense of the big differences between customers and non-customers.

In [ ]:
cluster_5 = clean_azdias[kmeans.labels_ == 5]
compare_features([cluster_0, cluster_5], ["Cluster 0", "Cluster 5"])
In [146]:
feats_to_drop = ["D19_BANKEN_DATUM", "D19_BANKEN_OFFLINE_DATUM", "D19_BANKEN_ONLINE_DATUM", "D19_TELKO_DATUM",
                 "D19_TELKO_OFFLINE_DATUM", "D19_TELKO_ONLINE_DATUM", "D19_VERSI_DATUM", "D19_VERSI_OFFLINE_DATUM",
                 "D19_VERSI_ONLINE_DATUM", "ANZ_HH_TITEL", "KBA05_MODTEMP", "BALLRAUM", "INNENSTADT", "EWDICHTE",
                 ]

From the previous plots we can select several of them to finally be able to represent some differences between customers and non-customers, and these features are:

  1. ALTERSKATEGORIE_GROB (Age through prename analysis)
  2. ANREDE_KZ (Gender)
  3. CJT_GESMATTYP (Preferred information and buying channels)
  4. FINANZTYP (Financial type)
  5. LP_LEBENSPHASE_FEIN (Lifestage)
  6. RETOURTYP_BK_S (Return type)
  7. ALTER_HH (Main age within household)
  8. HH_EINKOMMEN_SCORE (Estimated household net income)
  9. WOHNLAGE (Neighbourhood area)
  10. MOBI_REGIO (Moving patterns)

There are more features that also emphasize differences between customers and non-customers, however I found that they offer redundant information.

There are also some features that shows no difference between the two groups, specifcally features that are related to motor vehciles information in PLZ8 areas or microcells. Which indicated that we might want to use some sort of variance thresholding passing the data to a machine learning algorithm.

Right now it's time to conclude the differences between customers and non-customers through visualizing the features that I listed.

The way the I'll do this is that I'll partition clusters with high tendency to become customers in on dataframe, and others with low tendency in another dataframe, then I'll visualize the stated features between them.

I'll also make separate plot that dissects each group's clusters in order to have an idea about the differences between them.

First let's take a better look at cluster percentages in customers and general population

In [147]:
clusters_p.sort_values(["CUSTOMERS", "AZDIAS"], ascending=False).plot(kind="bar", color=["grey", "tab:blue"]);
plt.tight_layout()
plt.title("How Are Clusters Represented in Customer and General Population?");

We can see the clusters 0, 4 and 6 have the greater tendency of being customers, while clusters 2, 3 and 5 have the greater tendency of being non-customers, leaving cluster 1 with slightly greater tendency of being non-customer.

My intuition is that we should visualize the differences between cluster 0-4-6 and clusters 2-3-5 leaving out cluster 1, as individuals in this cluster have really similar tendencies of being customers or non-customers. So that would decrease the sharpness of differences between the two groups.

In [148]:
customers = clean_azdias[np.in1d(kmeans.labels_, [0, 4, 6])]
cluster_0 = clean_azdias[kmeans.labels_ == 0]
cluster_4 = clean_azdias[kmeans.labels_ == 4]
cluster_6 = clean_azdias[kmeans.labels_ == 6]

non_customers = clean_azdias[np.in1d(kmeans.labels_, [2, 3, 5])]
cluster_2 = clean_azdias[kmeans.labels_ == 2]
cluster_3 = clean_azdias[kmeans.labels_ == 3]
cluster_5 = clean_azdias[kmeans.labels_ == 5]

While plotting, we need to map values of features to their meanings in order to make it easier to understand the plots. So let's check the DIAS VALUES sheets.

In [149]:
dias_vals.head(10)
Out[149]:
Unnamed: 0 Attribute Description Value Meaning
0 NaN AGER_TYP best-ager typology -1 unknown
1 NaN AGER_TYP NaN 0 no classification possible
2 NaN AGER_TYP NaN 1 passive elderly
3 NaN AGER_TYP NaN 2 cultural elderly
4 NaN AGER_TYP NaN 3 experience-driven elderly
5 NaN ALTERSKATEGORIE_GROB age classification through prename analysis 0 unknown
6 NaN ALTERSKATEGORIE_GROB NaN 1 < 30 years
7 NaN ALTERSKATEGORIE_GROB NaN 2 30 - 45 years
8 NaN ALTERSKATEGORIE_GROB NaN 3 46 - 60 years
9 NaN ALTERSKATEGORIE_GROB NaN 4 > 60 years

We need to forward fill Attribute in order to map Value to Meaning.

In [150]:
# ffill Attribute 
dias_vals["Attribute"].fillna(method="ffill", inplace=True)
dias_vals.head()
Out[150]:
Unnamed: 0 Attribute Description Value Meaning
0 NaN AGER_TYP best-ager typology -1 unknown
1 NaN AGER_TYP NaN 0 no classification possible
2 NaN AGER_TYP NaN 1 passive elderly
3 NaN AGER_TYP NaN 2 cultural elderly
4 NaN AGER_TYP NaN 3 experience-driven elderly

We also need to change multiple values in Value into one.

In [151]:
# change multiple values in Value into one value
dias_vals["Value"] = dias_vals["Value"].replace({"-1, 0": 0, "-1, 9": 9})
dias_vals.head(10)
Out[151]:
Unnamed: 0 Attribute Description Value Meaning
0 NaN AGER_TYP best-ager typology -1 unknown
1 NaN AGER_TYP NaN 0 no classification possible
2 NaN AGER_TYP NaN 1 passive elderly
3 NaN AGER_TYP NaN 2 cultural elderly
4 NaN AGER_TYP NaN 3 experience-driven elderly
5 NaN ALTERSKATEGORIE_GROB age classification through prename analysis 0 unknown
6 NaN ALTERSKATEGORIE_GROB NaN 1 < 30 years
7 NaN ALTERSKATEGORIE_GROB NaN 2 30 - 45 years
8 NaN ALTERSKATEGORIE_GROB NaN 3 46 - 60 years
9 NaN ALTERSKATEGORIE_GROB NaN 4 > 60 years

Now we need to figure out how to make a mapping for each feature between it's value and meaning.

In [152]:
feat = "ALTERSKATEGORIE_GROB"

# filter out featue
feat_vals = dias_vals[dias_vals["Attribute"] == feat]

# make a series of Value and Meaning with Value as index
feat_vals_meaning = feat_vals[["Value", "Meaning"]].set_index("Value")

# convert series to dict 
feat_vals_meaning = feat_vals_meaning.to_dict()["Meaning"]

feat_vals_meaning
Out[152]:
{0: 'unknown',
 1: '< 30 years',
 2: '30 - 45 years',
 3: '46 - 60 years',
 4: '> 60 years',
 9: 'uniformly distributed'}
In [161]:
compare_feature("ALTERSKATEGORIE_GROB", title="Age", figsize=(12, 5))
plt.tight_layout()
plt.savefig("Age.png")

Age

Customers and Non-Customers

We can see that customers have greater probability of being older, with almost 80% being above 45 years old. On the other hand non-customers tend have more than 50% of less than 46 years old. The age groups that is mostly shared between the two groups is 46-460 years group.

Customer Clusters

We can see that cluster 4 stand out with higher percentage of indiviudals less than 46 years old, while cluster 0 has more than 90% of it's population above 46 years old.

So cluster 0 includes is mostly elders with the majority being above 60 years old, while cluster 4 has the majority above 45 but also has higher than average percentage of younger indivduals, and cluster 6 is similar to cluster 0 except that the percentage of 46-60 years indivduals is larger than cluster 0.

Gender

In [163]:
compare_feature("ANREDE_KZ", "Gender", figsize=(12, 4))
plt.tight_layout()
plt.savefig("Gender.png")

Customers and Non-Customers

The percentage of males in customers is higher than that in non-customers, while the percentage of females in both is higher than males.

Customer Clusters

Cluster 0 has an over representation of males, where males is higher than all clusters and higher than female percentage in the same cluster. While cluster 4 and 6 have higher female percentages than cluster 0.

Preferred Information and Buying Channels

In [171]:
compare_feature("CJT_GESAMTTYP", "Channels", figsize=(12, 5))
# plt.tight_layout()
plt.savefig("Channels.png", bbox="tight")

Customer and Non-Customers

We can see than customers exceed non-customers in percentages of advertising and consumption minamilists and traditionalists, while non-customers tend to be more open in that spectrum.

Customer Clusters

Since cluster 0 mostly represents elderly individuals, it's expected that they will be over represented in the minimalists and traditionlists. And also since cluster 4 represents the younger customers, we don't see alot of them as minimalist and traditionalists. And finally we can see that cluster 6 has the most uniform distribtution across the spectrum.

Financial Type

In [172]:
compare_feature("FINANZTYP", "Financial Type", figsize=(12, 5))
plt.tight_layout()
plt.savefig("Financial.png")

Customers and Non-Customers

20% of customers are money savers, while another 20% are inverstors, and around 35% are unremarkable which I guess means that they have no specific financial type. On the other hand, non-customers tend to have low financial interest.

Customer Clusters

We can that the majority of cluster 0 with distinguished financial type are money savers, while in cluster 6 they are investors. Cluster 4 doesn't show a specific type.

Life Stage

In [178]:
compare_feature("LP_LEBENSPHASE_GROB", "Life Stage", figsize=(12, 6))
# plt.tight_layout()
plt.savefig("Life_Stage.png", bbox="tight")

Customers and Non-Customers

The most frequent non-customer type is single low-income and average earners of younger age, while customers' most frequent type is singe high-income earner. However, there is no great difference between the most frequent value of customers and two next most frequent values, indicating the difference between clusters.

Customer Clusters

Around 70% of cluster 6 are single, with the majority of them being single low-income average earners of higher age., while the most frequent type in cluster 0 is single high-income earners, while cluster 4's most frequent type is high income earner of higher age from multiperson households. However, the remaining majority of cluster 4 types falls in younger aged families with different types of income.

Return Type

In [179]:
compare_feature("RETOURTYP_BK_S", "Return Type", figsize=(12, 6))
plt.tight_layout()
plt.savefig("Return.png");

Customers and Non-Customers

The most frequent type in customers is determined minimal returner, which I think means that these individuals aren't the shopping type. They only buy what they need when they need it. The second frequent type in incentive receptive normal returner. While in non-customers, we can see that the most frequent type is influencable crazy shopper, and these wouldn't definetly be interested in mail-order cataloges.

Customers Clusters

First off we can see the cluster 0 and 6 are the only populating most of the customers belonging to the determined minimal returner category, and that makes sense since they are older individuals, and we have found that they are consumption minimalists and traditionalists. On the other hand, cluster 4 populates every other category with frequency higher than the determined minmal returner one, with them most frequent being demanding heavy returner.

Main Age within Household

In [181]:
compare_feature("ALTER_HH", "Main Age within Household", figsize=(12, 5))
plt.tight_layout()
plt.savefig("Main_Age.png")

Customers and Non-Customers

We have already investigated the age difference between customers and non-customers, and we can see that the main age within the household is also different between the two groups, where customers households tend be also older in age, while non-customers households tend to be younger.

Customer Clusters

We can see that cluster 4 is the main cluster populating younger ages in customers clusters, while cluster 0 and 6 have nearly identical distributions representing the elderly segments of the customers.

Estimated Net Household Income

In [183]:
compare_feature("HH_EINKOMMEN_SCORE", "Net Household Income", figsize=(12, 4))
plt.tight_layout()
plt.savefig("Net_Household.png");

Customers and Non-Customers

We can see a huge difference between the distribution of customers and non-customers among estimated net household income, where more than 50% of non-customers come from very low income households, and only around 15% of customers do. The most frequent in customers is average income, and the second most is lower income. However, the total percentage of customers whose income is average or above exceeds 50%.

Customers Clusters

Now we can see a difference between the two older segments, which are cluster 0 and 6. We can see that over 60% of cluster 6 households have either lower or very low income, while more than 70% of cluster 0 has average or higher income. Similarily cluster 4 also has around 70% of it's households having average or higher income.

Does this mean that cluster 6 is poorer than cluster 0?

Will that would be the case if this feature indicated the income of the individual, however since this feature indicates the net household income, this doesn't say anything about the specific individuals in cluster 0 and 6. Since cluster 6 had more tendency to be single, it makes sense that cluster 0 household income would be higher, because if cluster 0 is financially above average, it's safe to say that probably the rest of their family is the same, and that would make their net income larger than the same individual if he was single, and that's the situation for cluster 6.

Neighborhood Area

In [185]:
compare_feature("WOHNLAGE", "Neighborhood Area", figsize=(12, 6))
plt.tight_layout()
plt.savefig("Neighborhood.png");

Customers and Non-Customers

The most frequent neighborhood area in both customers and non-customers is average neighborhoods, however the next most frequent for customers is rural neighborhoods, while that of non-customers is poor neighboorhoods. We can also see that the percentage of customers occupying above average neighborhood areas is larger than non-customers'.

Customers Clusters

We can see that our remark about the household income difference between cluster 0 and 6 has been useful, because cluster 6 to have the highest percentage occupying average and above neighborhood areas, while the most frequent neighborhood area for cluster 0 is rural areas, since they are mostly families. Cluster 4 is extremely similar to cluster 0 in this attribute.

Moving Patterns

In [187]:
compare_feature("MOBI_REGIO", "Moving Patterns", figsize=(12, 4))
plt.tight_layout()
plt.savefig("Moving_Patterns.png");

Customers and Non-Customers

50% of customers are classified as having low or very mobility, while more than 60% of non-customers are the extreme opposite, with classification of either high or very high mobility.

Customers Clusters

Once again we can see some of the differing factors between cluster 0 and 6, since cluster 6 are mostly single individuals, more than 60% of their moving pattern is high or very high and 25% have middle mobility. On the other hand, since cluster 0 and 4 tend to in families, their mobility is much lower than cluster 6, with almost 75% of cluster 0 having low or very low mobility, and 65% of cluster 4 having the same.

Now that we have a clear idea about who our customers are, let's move into the next part.

Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

The first thing that we should do now is a baseline model to improve upon.

We have already used the general population data for scaling, dimensionality reduction and clustering. We can use this pipeline for engineering new features, but right now the best thing is to make the most basic model that we can ever make which can deliver the fastest results and then we can think about how we can imporve it.

So we'll follow the main steps that we have followed before to see if this dataset will have similar properties to the general population dataset in order to see if the cleaning pipeline will need any changes.

In [188]:
mailout_train = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')
/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (18,19) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Load necessary prerequistes if the notebook was restarted

In [189]:
# load imputer
imputer = pickle.load(open("imputer.pkl", "rb"))

# load scaler and pca
scaler = pickle.load(open("scaler.pkl", "rb"))
pca = pickle.load(open("pca.pkl", "rb"))
d = 224

# load kmeans model
kmeans = pickle.load(open("kmeans.pkl", "rb"))

# load clean_azdias
clean_azdias = pickle.load(open("clean_azdias.pkl", "rb"))

Missing Values

In [190]:
# Replace unknown values with null
mailout_train_new = replace_unknown_with_null(mailout_train)
In [191]:
# Check null percentages in columns and rows
fig = plt.figure(figsize=(14, 4))

plt.subplot(1, 2, 1)
plt.title("Null Percentages in Columns")
mailout_null_columns = get_null_prop(mailout_train_new, axis=0, plot=True)

plt.subplot(1, 2, 2)
plt.title("Null Percentages in Rows")
mailout_null_rows = get_null_prop(mailout_train_new, axis=1, plot=True)

plt.tight_layout()
plt.savefig("mailout_null_before.png")
In [192]:
# Test cleaning the dataset using cleaning function
mailout_train_clean = clean_dataset(mailout_train)

print("Shape before cleaning:", mailout_train.shape)
print("Shape after cleaning:", mailout_train_clean.shape)

# Check if feature set is the same as clean AZDIAS
set(mailout_train_clean.columns) == set(clean_azdias.columns)
Shape before cleaning: (42962, 367)
Shape after cleaning: (35093, 361)
Out[192]:
False

There seems to be some change in features in cleaned MAILOUT.

In [193]:
# Check for difference in features in both datasets
print("Features in clean MAILOUT not in clean AZDIAS:", set(mailout_train_clean.columns).difference(clean_azdias.columns))
print("Features in clean AZDIAS not in clean MAILOUT:", set(clean_azdias.columns).difference(mailout_train_clean.columns))
Features in clean MAILOUT not in clean AZDIAS: {'RESPONSE', 'D19_BUCH_CD', 'AGER_TYP', 'D19_VOLLSORTIMENT', 'VHA', 'D19_SOZIALES', 'EXTSEL992', 'D19_SONSTIGE'}
Features in clean AZDIAS not in clean MAILOUT: {'KBA13_ANTG4'}

So there are some features that we have in MAILOUT that weren't dropped, so if we opted to use these there are some points that we need to take care of:

  1. They aren't included in imputation pipeline
  2. They aren't inlcuded in scaling pipeline
  3. They aren't included in dimensionality reduction transformer
  4. They aren't included in clustering algorithm

So they could be concatenated with the results of the original pipeline and cleaning separetly if wanted to use them, or we could just drop them.

Right now since I'm opting to make a simple baseline, I'll keep them since I won't be using the pipeline that we have created just yet.

In [194]:
mailout_train_clean = clean_dataset(mailout_train_new, keep_features=["KBA13_ANTG4"])
In [195]:
# Check null percentages in columns and rows
fig = plt.figure(figsize=(14, 4))

plt.subplot(1, 2, 1)
plt.title("Null Percentages in Columns")
mailout_null_columns = get_null_prop(mailout_train_clean, axis=0, plot=True)

plt.subplot(1, 2, 2)
plt.title("Null Percentages in Rows")
mailout_null_rows = get_null_prop(mailout_train_clean, axis=1, plot=True)

plt.tight_layout()
plt.savefig("mailout_null_after.png")

I'll fill these columns with an ad-hoc most frequent imputation just to make the baseline model. But I first need to take a look at these features that weren't included in our previous analysis.

In [196]:
new_feats = ['D19_BUCH_CD', 'VHA', 'EXTSEL992', 'D19_SOZIALES',
             'D19_VOLLSORTIMENT', 'RESPONSE', 'AGER_TYP', 'D19_SONSTIGE']
In [197]:
mailout_train_clean[new_feats].head()
Out[197]:
D19_BUCH_CD VHA EXTSEL992 D19_SOZIALES D19_VOLLSORTIMENT RESPONSE AGER_TYP D19_SONSTIGE
0 NaN 1.0 47.0 1.0 6.0 0 2.0 NaN
1 NaN 1.0 56.0 5.0 6.0 0 1.0 6.0
2 NaN 4.0 36.0 2.0 6.0 0 1.0 6.0
3 6.0 1.0 41.0 1.0 6.0 0 2.0 6.0
4 NaN NaN 55.0 1.0 7.0 0 2.0 7.0
In [198]:
mailout_train_clean[new_feats].hist(figsize=(14, 14));
plt.tight_layout()
plt.savefig("mailout_new_feats.png");
In [199]:
mailout_train_clean[new_feats].isna().sum()/mailout_train_clean.shape[0]
Out[199]:
D19_BUCH_CD          0.427863
VHA                  0.447326
EXTSEL992            0.247884
D19_SOZIALES         0.282592
D19_VOLLSORTIMENT    0.378623
RESPONSE             0.000000
AGER_TYP             0.305047
D19_SONSTIGE         0.228963
dtype: float64

I have noticed that RESPONSE is extremely imbalanced, which means that results of the baseline will probably be really bad, which is good because it will give us room for improvement.

Right now I'll impute old features using previously made imputer and then impute new features based on the method we have used before, which is mean value for every feature as all of them have more than 10% missing values.

In [200]:
# Impute clean AZDIAS features using old imputer
mailout_train_clean.loc[:, clean_azdias.columns] = imputer.transform(mailout_train_clean[clean_azdias.columns])
In [201]:
# Make new imputer for remaining features
mailout_imputer = SimpleImputer(strategy="mean")

# Fit imputer and impute the remaining columns in place
mailout_train_clean.loc[:, :] = mailout_imputer.fit_transform(mailout_train_clean)

# Assert if all null values have been removed
assert (mailout_train_clean.isna().sum() > 0).any() == False

Finally we should check for any null values in RESPONSE.

In [202]:
mailout_train_clean["RESPONSE"].isna().sum()
Out[202]:
0
In [203]:
# Split training data into X and y
X = mailout_train_clean.loc[:, mailout_train_clean.columns != "RESPONSE"]
y = mailout_train_clean.loc[:, "RESPONSE"]

Now that we have a clean dataset it's time to train a model

The model that I have in mind is RandomForestClassifier.

In order to evaluate this baseline, we need to have some sort of validation set in order to score our results.

For validation I'll use Stratified KFold cross validation (in order to account for RESPONSE imbalance), and I'll use sklearn's classfication report which shows recall, precision and f1-score for each label.

I'll also use ROC AUC score since it's the score of the final competition.

The advantage of classification report is that it shows us the whole picutre, so we don't get decieved if the model is performing poorly on the under-represented class, and performing better on the over-represented one.

In [25]:
model_validation(RandomForestClassifier(random_state=seed), X, y)
Final Report:
              precision    recall  f1-score   support

         0.0       0.99      1.00      0.99     34658
         1.0       0.00      0.00      0.00       436

    accuracy                           0.99     35094
   macro avg       0.49      0.50      0.50     35094
weighted avg       0.98      0.99      0.98     35094

Metric Score: 0.5
/opt/conda/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/opt/conda/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/opt/conda/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

So as you can see, the results are really bad, and the silver lining of these results as that there is plenty of room for imporvement.

First we know that the target is extremely imbalanced, so we need to address this by using over-sampling or under-sampling techniques with the machine learning algorithm that we'll use.

Fortunately, this is provided by imblearn package.

In [33]:
model_validation(BalancedRandomForestClassifier(n_estimators=1000, random_state=seed), X, y)
Final Report:
              precision    recall  f1-score   support

         0.0       1.00      0.69      0.81     34658
         1.0       0.03      0.73      0.06       436

    accuracy                           0.69     35094
   macro avg       0.51      0.71      0.44     35094
weighted avg       0.98      0.69      0.81     35094

Metric Score: 0.7085381814866175

Baseline Results

We can see that this model still has bad performance, but it's siginifcantly better than the regular RandomForestClassifier.

The best thing about classification_report is that it gives us a variety of metrics that we can use to judge our model.

But how should we judge our model?

That depends on margin of error that we are willing to except with our model. So for example, we might want a model that is able to predict all indivduals likely to be customers, despite predicting a large sum of individuals that won't be customers. In this case we'd want a model with higher Recall.

On the other hand, sending out to a huge mass might be expensive and counter-intuitive business-wise in some case. In this case we'd want a model with higher Precision.

In order to get the best of both worlds, we can use the F1-Score, which calculate a harmonic average of Recall and Precision, penalizing a model that has bad scores for either one of the metrics.

Personally I prefer using F1-Score, specifically the macro-averaged F1-Score, which calculate the average of F1-Score for each class regardless the number of data points belonging to that class, as a weighted average F1-Score wouldn't account the main class we are concerned with predicting due to it's extremely low number.

It also looks like ROC AUC Macro is identical to Recall Macro, so we can use Recall as a proxy for ROC AUC instead of calculating it twice.

I think that getting a model with high precision with this dataset would be a stretch, so my best guess right now is to use Recall, but the other metric to avoid a model with really bad precision.

Therefore, we can see the results are:

  1. Recall (Macro): 0.71
  2. Precision (Macro): 0.51
  3. F1-Score (Macro): 0.44

The first imporvement that we can make is to utilize the dimensionality reduction that we have made earlier.

  1. Select only features in clean_azdias
  2. Scale the features and reduce dimensions using PCA
  3. Predict using PCA

Predicting using PCA reduced dataset

In [204]:
# Filter for features used in clean_azdias
mailout_train_pre_pca = mailout_train_clean[clean_azdias.columns]

# Scale the features
mailout_train_pre_pca = scaler.transform(mailout_train_pre_pca)

# Reduce dimensions using pca to d dimensions
X = pd.DataFrame(pca.transform(mailout_train_pre_pca)[:, :d])
In [29]:
model_validation(BalancedRandomForestClassifier(n_estimators=1000, random_state=seed), X, y)
Final Report:
              precision    recall  f1-score   support

         0.0       0.99      0.59      0.74     34658
         1.0       0.02      0.58      0.03       436

    accuracy                           0.59     35094
   macro avg       0.50      0.58      0.39     35094
weighted avg       0.98      0.59      0.73     35094

Metric Score: 0.5825897065477098

The results of PCA transformed training set with BalancedRandomForestClassifier is worse than using the vanilla dataset.

Could these results be improved with other algorithms?

We could spot-check multiple algorithms to prove if this is the case.

Instead of implementing under-sampling or over-sampling in validation, we can use BalancedBaggingClassifier to wrap the classfication algortithms that we are interested in using. But we'll need to scale the data because algorithms like Logistic Regression, KNN and SVM can perform better after scaling the data.

Note that when comparing different models result, I'll not increase n_estimators to save time, as the results are just for comparison.

In [28]:
bagging_model = lambda model: BalancedBaggingClassifier(model)

models = {"RF": BalancedRandomForestClassifier(random_state=seed),
          "LR": bagging_model(LogisticRegression(max_iter=1000, random_state=seed)),
          "KNN": bagging_model(KNeighborsClassifier()),
          "Linear SVM": bagging_model(SVC(kernel="linear", random_state=seed)),
          "RBF SVM": bagging_model(SVC(kernel="rbf", random_state=seed))}
In [53]:
pca_bagging_scores = evaluate_models(models, X, y, "recall_macro")
Model:RF, Score:0.549 (+/- 0.026)
Model:LR, Score:0.558 (+/- 0.006)
Model:KNN, Score:0.539 (+/- 0.023)
Model:Linear SVM, Score:0.536 (+/- 0.009)
Model:RBF SVM, Score:0.571 (+/- 0.016)

We can see that Bagging using LogisitcRegression and RBF SVM exceeded the recall of BalancedRandomForestClassifier. However we have to note that we didn't increase the number of estimators, and we still don't know the exact recall of the responsive class.

What if we trained them on the original data?

But we'll need to scale the data because algorithms like Logistic Regression, KNN and SVM can perform better after scaling the data.

In [112]:
# Split training data into X and y
X = mailout_train_clean.loc[:, mailout_train_clean.columns != "RESPONSE"]
y = mailout_train_clean.loc[:, "RESPONSE"]
In [30]:
scaled_pipeline = lambda model: make_pipeline(StandardScaler(), model)

models = {"RF": BalancedRandomForestClassifier(random_state=seed),
          "LR": scaled_pipeline(bagging_model(LogisticRegression(max_iter=1000, random_state=seed))),
          "KNN": scaled_pipeline(bagging_model(KNeighborsClassifier())),
          "Linear SVM": scaled_pipeline(bagging_model(SVC(kernel="linear", random_state=seed))),
          "RBF SVM": scaled_pipeline(bagging_model(SVC(kernel="rbf", random_state=seed)))}
In [56]:
original_bagging_scores = evaluate_models(models, X, y, "recall_macro")
Model:RF, Score:0.673 (+/- 0.013)
Model:LR, Score:0.585 (+/- 0.016)
Model:KNN, Score:0.533 (+/- 0.028)
Model:Linear SVM, Score:0.576 (+/- 0.013)
Model:RBF SVM, Score:0.573 (+/- 0.006)

By comparing the results, we can see that all of the algorithms (except Bagged KNN) perform better using the original data. However the improvement in all of them compared to BalancedRandomForestClassifier is tiny, so we need not continue in exploring them.

In terms of the significant improvement in BalancedRandomForestClassifier, this could be for two reasons:

  1. The features that aren't included in the PCA could add important information to the model
  2. The PCA transformed data isn't particularly good for the task that we need

We can test this by adding all of the features that were left our from PCA transformation to the transformed data to test the performance of the algorithm again.

In [113]:
# Make new dataframe with only new feats
mailout_train_new_feats = mailout_train_clean.loc[:, new_feats]

# Concatenate PCA transformed dataframe with new feats dataframe
X = pd.concat([pd.DataFrame(pca.transform(mailout_train_pre_pca)[:, :d]), mailout_train_new_feats.reset_index(drop=True)], axis=1)

# Drop RESPONSE
X = X.drop(columns=["RESPONSE"])
In [58]:
model_validation(BalancedRandomForestClassifier(n_estimators=1000, random_state=seed), X, y)
Final Report:
              precision    recall  f1-score   support

         0.0       1.00      0.70      0.82     34658
         1.0       0.03      0.74      0.06       436

    accuracy                           0.70     35094
   macro avg       0.51      0.72      0.44     35094
weighted avg       0.98      0.70      0.81     35094

Metric Score: 0.7209044081958259

We can see that these features have definetly improved upon our PCA results, and even made it better than the original dataset score.

Right now what if we add the PCA features to all features?

In [32]:
# Concatenate PCA transformed dataframe with new feats dataframe
X = pd.concat([pd.DataFrame(pca.transform(mailout_train_pre_pca)[:, :d]), mailout_train_clean.reset_index(drop=True)], axis=1)

# Drop RESPONSE
X = X.drop(columns=["RESPONSE"])
In [61]:
model_validation(BalancedRandomForestClassifier(n_estimators=1000, random_state=seed), X, y)
Final Report:
              precision    recall  f1-score   support

         0.0       0.99      0.67      0.80     34658
         1.0       0.03      0.72      0.05       436

    accuracy                           0.67     35094
   macro avg       0.51      0.70      0.43     35094
weighted avg       0.98      0.67      0.79     35094

Metric Score: 0.6977696440858011

We can see that this has worsened the results. It seems that the PCA transformer features are able to provide info that is better than the original features before transformation, and the presence of both at the same time doesn't improve the model performance at all.

What if we add the clusters distances as features to PCA features + new features combo?

In [205]:
# transform mailout train data using PCA
mailout_train_pca = pd.DataFrame(pca.transform(mailout_train_pre_pca)[:, :d], 
                                 columns=[f"pca_{i}" for i in range(d)])

# predict mailout cluster distances
mailout_distances = pd.DataFrame(kmeans.transform(mailout_train_pca),
                                 columns=[f"cluster_{i}" for i in range(kmeans.cluster_centers_.shape[0])])

# predict mailout cluster
mailout_clusters = pd.Series(kmeans.predict(mailout_train_pca), name='label')

# visualize mailout clusters
mailout_clusters.value_counts().plot(kind="bar", title="How Are Clusters Distributed in MAILOUT Data?");

plt.tight_layout()
plt.savefig("mailout_clusters.png");

If we were to use these clusters directly, we would predict that the majority of the individuals in the mailout data would respond, but we know from the labels than only a very small portion of them actually responded.

In [47]:
# concatenate cluster labels to the clean dataset
X = pd.concat([mailout_train_pca, mailout_train_new_feats.reset_index(drop=True), mailout_distances], axis=1)

# drop RESPONSE
X = X.drop(columns=["RESPONSE"])
In [71]:
model_validation(BalancedRandomForestClassifier(n_estimators=1000, random_state=seed), X, y)
Final Report:
              precision    recall  f1-score   support

         0.0       0.99      0.69      0.81     34658
         1.0       0.03      0.72      0.05       436

    accuracy                           0.69     35094
   macro avg       0.51      0.70      0.43     35094
weighted avg       0.98      0.69      0.80     35094

Metric Score: 0.7035868915429383

The results are worse than just using PCA features and new features.

What is we just use cluster distances and new features?

I think that for this we could also test using other models, since the number of features it very small. So we could try other models using undersampling.

In [34]:
# concatenate cluster labels to the clean dataset
X = pd.concat([mailout_train_new_feats.reset_index(drop=True), mailout_distances], axis=1)

# drop RESPONSE
X = X.drop(columns=["RESPONSE"])
In [73]:
model_validation(BalancedRandomForestClassifier(n_estimators=1000, random_state=seed), X, y)
Final Report:
              precision    recall  f1-score   support

         0.0       1.00      0.68      0.81     34658
         1.0       0.03      0.82      0.06       436

    accuracy                           0.68     35094
   macro avg       0.51      0.75      0.43     35094
weighted avg       0.98      0.68      0.80     35094

Metric Score: 0.7474211491200037

We can see that the results have significantly improved, which indicates that the presence of all original features isn't useful for the model.

So we might want to test automatic feature selection to see if we can still use some of them and the PCA features to imporve the final results.

Automatic Feature Selection

There are several methods present in scikit-learn for feature selection, but I'll use SelectKBest which removes all but the highest scoring K features.

Let's just test SelectKBest on all of the features we had so far to see if it can get better results that our best trial so far (Recall 0.74)

In [115]:
# Concatenate all features
X = pd.concat([mailout_train_pca, mailout_train_clean.reset_index(drop=True), mailout_distances, mailout_clusters], axis=1)

# Drop RESPONSE
X = X.drop(columns=["RESPONSE"])
In [116]:
for k in [10, 30, 50, 100, 300, 500]:
    print("K:", k)
    
    # make pipeline
    pipeline = make_pipeline(SelectKBest(k=k),
                             BalancedRandomForestClassifier(n_estimators=1000, random_state=seed))
    
    model_validation(pipeline, X, y)
    print()
K: 10
Final Report:
              precision    recall  f1-score   support

         0.0       1.00      0.70      0.83     34657
         1.0       0.03      0.83      0.07       436

    accuracy                           0.71     35093
   macro avg       0.52      0.77      0.45     35093
weighted avg       0.99      0.71      0.82     35093

Metric Score: 0.7671317445321728

K: 30
Final Report:
              precision    recall  f1-score   support

         0.0       1.00      0.71      0.83     34657
         1.0       0.03      0.82      0.07       436

    accuracy                           0.71     35093
   macro avg       0.52      0.76      0.45     35093
weighted avg       0.98      0.71      0.82     35093

Metric Score: 0.7620556696904557

K: 50
Final Report:
              precision    recall  f1-score   support

         0.0       1.00      0.71      0.83     34657
         1.0       0.03      0.80      0.06       436

    accuracy                           0.71     35093
   macro avg       0.51      0.75      0.45     35093
weighted avg       0.98      0.71      0.82     35093

Metric Score: 0.7526066873716092

K: 100
Final Report:
              precision    recall  f1-score   support

         0.0       1.00      0.71      0.83     34657
         1.0       0.03      0.79      0.06       436

    accuracy                           0.71     35093
   macro avg       0.51      0.75      0.45     35093
weighted avg       0.98      0.71      0.82     35093

Metric Score: 0.7497644021549252

K: 300
Final Report:
              precision    recall  f1-score   support

         0.0       1.00      0.70      0.83     34657
         1.0       0.03      0.77      0.06       436

    accuracy                           0.71     35093
   macro avg       0.51      0.74      0.44     35093
weighted avg       0.98      0.71      0.82     35093

Metric Score: 0.7353716289811935

K: 500
Final Report:
              precision    recall  f1-score   support

         0.0       1.00      0.70      0.82     34657
         1.0       0.03      0.79      0.06       436

    accuracy                           0.70     35093
   macro avg       0.51      0.74      0.44     35093
weighted avg       0.98      0.70      0.81     35093

Metric Score: 0.744009369775735

We can see that selecting that using the top 10 features is enough to improve the Macro Recall to 0.77.

I'm still interested to see if we only use the feature selection on the PCA and old features only, while keeping the new feature and distances untouched.

In [117]:
pca_feats = list(mailout_train_pca.columns)
azdias_feats = list(clean_azdias.columns)

for k in [10, 30, 50]:
    print("K:", k)
   
    pipeline = Pipeline([
        ('feature_selection', ColumnTransformer([
            ('kbest', SelectKBest(k=k), pca_feats+azdias_feats),
        ], remainder='passthrough')),
        ('clf', BalancedRandomForestClassifier(n_estimators=1000, random_state=seed))
    ])
    
    model_validation(pipeline, X, y)
    print()
K: 10
Final Report:
              precision    recall  f1-score   support

         0.0       1.00      0.70      0.82     34657
         1.0       0.03      0.83      0.06       436

    accuracy                           0.70     35093
   macro avg       0.52      0.77      0.44     35093
weighted avg       0.99      0.70      0.81     35093

Metric Score: 0.7663125665425543

K: 30
Final Report:
              precision    recall  f1-score   support

         0.0       1.00      0.70      0.83     34657
         1.0       0.03      0.82      0.06       436

    accuracy                           0.71     35093
   macro avg       0.52      0.76      0.45     35093
weighted avg       0.98      0.71      0.82     35093

Metric Score: 0.7617978118285937

K: 50
Final Report:
              precision    recall  f1-score   support

         0.0       1.00      0.71      0.83     34657
         1.0       0.03      0.81      0.06       436

    accuracy                           0.71     35093
   macro avg       0.51      0.76      0.45     35093
weighted avg       0.98      0.71      0.82     35093

Metric Score: 0.7555872211446162

Now we know it's better to just pass all features for automatic feature selection, as the performance didn't improve over the best score.

Now I'll make a function to document all of the steps we did for preparing the dataset, so we'll be able to do the same in testing.

In [206]:
def prepare_mailout(df, p=0.5, test=False):
    """Prepare MAILOUT training and testing dataset for ML Pipeline."""
    # Set dropping threshold to 1.0 for test set
    if test:
        p = 1.0
        
    # Clean the dataset
    df_clean = clean_dataset(df, p_row=p, p_col=p, keep_features=["KBA13_ANTG4"])
    
    # Drop RESPONSE if train set
    if test:
        y = None
    else:
        y = df_clean["RESPONSE"]
        df_clean.drop("RESPONSE", axis=1, inplace=True)
        
    # Filter features used in train set only
    if test:
        train_feats = pickle.load(open("mailout_train_feats.pkl", "rb"))
        df_clean = df_clean.loc[:, train_feats]
    else:
        train_feats = list(df_clean.columns)
        pickle.dump(train_feats, open("mailout_train_feats.pkl", "wb"))
    
    # Missing values
    # Impute clean AZDIAS features using old imputer
    azdias_imputer = pickle.load(open("imputer.pkl", "rb"))
    df_clean.loc[:, clean_azdias.columns] = azdias_imputer.transform(df_clean[clean_azdias.columns])
    
    # Impute remaning features
    if test:
        # Load mailout imputer for test set
        mailout_imputer = pickle.load(open("mailout_imputer.pkl", "rb"))
        df_clean = pd.DataFrame(mailout_imputer.transform(df_clean), columns=df_clean.columns)
    else:
        # Fit imputer for train set and pickle to load with test set
        mailout_imputer = SimpleImputer(strategy="mean")
        df_clean = pd.DataFrame(mailout_imputer.fit_transform(df_clean), columns=df_clean.columns)
        pickle.dump(mailout_imputer, open("mailout_imputer.pkl", "wb"))
    
    # PCA features
    df_pre_pca = df_clean[clean_azdias.columns]
    df_pre_pca_scaled = scaler.transform(df_pre_pca)
    df_pca = pd.DataFrame(pca.transform(df_pre_pca_scaled)[:, :d], 
                          columns=[f"pca_{i}" for i in range(d)])

    # Cluster distances
    df_distances = pd.DataFrame(kmeans.transform(df_pca),
                                columns=[f"cluster_{i}" for i in range(kmeans.cluster_centers_.shape[0])])

    # Cluster labels
    df_clusters = pd.Series(kmeans.predict(df_pca), name='label')
    
    # Concatenate all features
    X = pd.concat([df_clean, df_pca, df_distances, df_clusters], axis=1)
    
    return X, y

Hyperparameter Tuning The Final Pipeline

In [207]:
# Prepare training set
X_train, y_train = prepare_mailout(mailout_train)
In [208]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('k_best',SelectKBest()),
    ('clf', BalancedRandomForestClassifier(random_state=seed))
])

pipeline_grid = {
    "k_best__k": [int(x) for x in np.linspace(start=5, stop=30, num=5)],
    "clf__n_estimators": [int(x) for x in np.linspace(start=1000, stop=3000, num=5)],
    "clf__max_depth": [int(x) for x in np.linspace(10, 110, num = 5)] + [None],
    "clf__min_samples_split": [2, 5, 10],
    "clf__min_samples_leaf": [1, 2, 4],
    "clf__bootstrap": [True, False],
}
In [209]:
combs = 1

for name, params in pipeline_grid.items():
    combs *= len(params)
    
print("Total number of combinations in parameters grid:", combs)
Total number of combinations in parameters grid: 2700

There 2700 different combinations to the hyperparamter grid that we have set. Therefore it is only logical to use randomized grid search, as we don't have the computational or time resources to find the best combination by brute force.

In [14]:
# # Instantiate StratifiedKFold object for CV
# skf = StratifiedKFold(n_splits=3)

# # Use RandomizedSearch to find the best hyperparameters combination
# pipeline_random = RandomizedSearchCV(estimator=pipeline, param_distributions=pipeline_grid,
#                                      scoring='recall_macro', n_iter=50, cv=skf, verbose=3,
#                                      random_state=seed, n_jobs=-1)

# # Fit the RandomizedSearch model to the training data
# pipeline_random.fit(X_train.values, y_train.values)
In [15]:
# pickle.dump(pipeline_random, open("pipeline_random.pkl", "wb"))
In [211]:
pipeline_random = pickle.load(open("pipeline_random.pkl", "rb"))

RandomizedSearchCV Results

In [212]:
best_params = pipeline_random.best_params_
print("Best parameters found:", best_params)
Best parameters found: {'k_best__k': 5, 'clf__n_estimators': 2500, 'clf__min_samples_split': 10, 'clf__min_samples_leaf': 2, 'clf__max_depth': 35, 'clf__bootstrap': True}

Checking metrics of final pipeline

In [213]:
pipeline.set_params(**best_params)

model_validation(pipeline, X_train, y_train, final_results=True, plot_confusion=True)
plt.tight_layout()
plt.savefig("confusion_matrix.png")
Final Report:
              precision    recall  f1-score   support

           0       1.00      0.70      0.82     34657
           1       0.03      0.84      0.07       436

    accuracy                           0.70     35093
   macro avg       0.52      0.77      0.44     35093
weighted avg       0.99      0.70      0.81     35093

Metric Score: 0.7696068445499852
In [214]:
pipeline_recall = 0.7696068445499852
baseline_recall = 0.7085381814866175
percent_increase = (pipeline_recall - baseline_recall) * 100 / baseline_recall
print("The final pipeline improved Macro Recall results by {:.2f}%".format(percent_increase))
The final pipeline improved Macro Recall results by 8.62%

The results aren't state of the art, but we can see that we are able to predict 84% of the responsive individuals, and only 30% of non-responsive ones are false positives.

Comparing the results to the baseline model, the Macro Recall has improved by 8.62%

Tuning The Decision Boundary

Now the final the thing that we can do is to tune the decision boundary to get the best results that we can get using this model. In order to do this, we need to predict the probability of classes first.

In [215]:
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=seed)

# Make empty y_pred to fill predictions of each fold
y_pred = np.zeros((y_train.shape[0], 2))

for i, (train_idx, test_idx) in enumerate(skf.split(X_train, y_train)):
    X_tr, X_test = X_train.iloc[train_idx], X_train.iloc[test_idx]
    y_tr, y_test = y_train.iloc[train_idx], y_train.iloc[test_idx]

    # Make copy of model
    model_clone = clone(pipeline)

    # Fit model
    model_clone.fit(X_tr, y_tr)

    # Predict fold y_test and add in y_pred
    fold_pred = model_clone.predict_proba(X_test)
    y_pred[test_idx, :] += fold_pred

Now we can plot the scores over all possible thresholds

In [216]:
threshs = np.linspace(0.4, 0.6, 20)
f1_scores = []
recall_scores = []
precision_scores = []

for thresh in threshs:
    thresh_pred = (y_pred[:, 1] > thresh).astype(int)
    f1_scores.append(f1_score(y_train, thresh_pred, average="macro"))
    recall_scores.append(recall_score(y_train, thresh_pred, average="macro"))
    precision_scores.append(precision_score(y_train, thresh_pred, average="macro", zero_division=1))
    
best_recall_thresh = threshs[np.argmax(recall_scores)]
print("Threshold optimizing Recall: {:.3f}".format(best_recall_thresh))
Threshold optimizing Recall: 0.516
In [217]:
plt.figure(figsize=(8, 4))
# plt.plot(threshs, f1_scores, label="F1");
plt.plot(threshs, recall_scores, label="Recall");
plt.plot(threshs, precision_scores, label="Precision");
plt.axvline(best_recall_thresh, linestyle='--', color='r')
plt.legend();
plt.title("Precision & Recall at Different Prediction Thresholds");
plt.tight_layout()
plt.savefig("thresholds.png");
In [218]:
best_thresh_pred = (y_pred[:, 1] > best_recall_thresh).astype(int)

ax = plt.subplot()
cmat = confusion_matrix(y_train, best_thresh_pred)
sns.heatmap(cmat, annot=True, fmt="g", ax=ax)

ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['No Response', 'Response']); ax.yaxis.set_ticklabels(['No Response', 'Response']);

plt.tight_layout()
plt.savefig("confusion_matrix1.png")

We can see that the best threshold slightly improved the precision of the model, by decreasing the number of false positives from 10327 to 10283.

Final Remarks

In each step we have done in the pipeline so far, there must have been other ways we could have achieved the same or even better results. But speaking from a business perspective, if this dataset provided the data one previous campaign were the conversion rate was minute as seen, using this model which isn't particularly state of the art, would definitely decrease the costs and increase the conversion rate of the campaign due to the increased selectivity.

Training Final Model

In [41]:
pipeline.set_params(**best_params)
pipeline.fit(X_train, y_train)
Out[41]:
Pipeline(steps=[('k_best', SelectKBest(k=5)),
                ('clf',
                 BalancedRandomForestClassifier(max_depth=35,
                                                min_samples_leaf=2,
                                                min_samples_split=10,
                                                n_estimators=2500,
                                                random_state=42))])

Saving Model

In [42]:
pickle.dump(pipeline, open("final_pipeline.pkl", "wb"))

Part 3: Kaggle Competition

Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link here, you'll be taken to the competition page where, if you have a Kaggle account, you can enter. If you're one of the top performers, you may have the chance to be contacted by a hiring manager from Arvato or Bertelsmann for an interview!

Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.

In [43]:
mailout_test = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TEST.csv', sep=';')
/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (18,19) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [44]:
mailout_test.head()
Out[44]:
LNR AGER_TYP AKT_DAT_KL ALTER_HH ALTER_KIND1 ALTER_KIND2 ALTER_KIND3 ALTER_KIND4 ALTERSKATEGORIE_FEIN ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL ANZ_KINDER ANZ_PERSONEN ANZ_STATISTISCHE_HAUSHALTE ANZ_TITEL ARBEIT BALLRAUM CAMEO_DEU_2015 CAMEO_DEUG_2015 CAMEO_INTL_2015 CJT_GESAMTTYP CJT_KATALOGNUTZER CJT_TYP_1 CJT_TYP_2 CJT_TYP_3 CJT_TYP_4 CJT_TYP_5 CJT_TYP_6 D19_BANKEN_ANZ_12 D19_BANKEN_ANZ_24 D19_BANKEN_DATUM D19_BANKEN_DIREKT D19_BANKEN_GROSS D19_BANKEN_LOKAL D19_BANKEN_OFFLINE_DATUM D19_BANKEN_ONLINE_DATUM D19_BANKEN_ONLINE_QUOTE_12 D19_BANKEN_REST D19_BEKLEIDUNG_GEH D19_BEKLEIDUNG_REST D19_BILDUNG D19_BIO_OEKO D19_BUCH_CD D19_DIGIT_SERV D19_DROGERIEARTIKEL D19_ENERGIE D19_FREIZEIT D19_GARTEN D19_GESAMT_ANZ_12 D19_GESAMT_ANZ_24 D19_GESAMT_DATUM D19_GESAMT_OFFLINE_DATUM D19_GESAMT_ONLINE_DATUM D19_GESAMT_ONLINE_QUOTE_12 D19_HANDWERK D19_HAUS_DEKO D19_KINDERARTIKEL D19_KONSUMTYP D19_KONSUMTYP_MAX D19_KOSMETIK ... LP_FAMILIE_GROB LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_STATUS_FEIN LP_STATUS_GROB MIN_GEBAEUDEJAHR MOBI_RASTER MOBI_REGIO NATIONALITAET_KZ ONLINE_AFFINITAET ORTSGR_KLS9 OST_WEST_KZ PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_GBZ PLZ8_HHZ PRAEGENDE_JUGENDJAHRE REGIOTYP RELAT_AB RETOURTYP_BK_S RT_KEIN_ANREIZ RT_SCHNAEPPCHEN RT_UEBERGROESSE SEMIO_DOM SEMIO_ERL SEMIO_FAM SEMIO_KAEM SEMIO_KRIT SEMIO_KULT SEMIO_LUST SEMIO_MAT SEMIO_PFLICHT SEMIO_RAT SEMIO_REL SEMIO_SOZ SEMIO_TRADV SEMIO_VERT SHOPPER_TYP SOHO_KZ STRUKTURTYP TITEL_KZ UMFELD_ALT UMFELD_JUNG UNGLEICHENN_FLAG VERDICHTUNGSRAUM VERS_TYP VHA VHN VK_DHT4A VK_DISTANZ VK_ZG11 W_KEIT_KIND_HH WOHNDAUER_2008 WOHNLAGE ZABEOTYP ANREDE_KZ ALTERSKATEGORIE_GROB
0 1754 2 1.0 7.0 NaN NaN NaN NaN 6.0 2.0 0.0 0.0 2.0 2.0 0.0 3.0 6.0 2B 2 13 5.0 5.0 1.0 2.0 5.0 5.0 5.0 5.0 0 0 10 0 0 0 10 10 0.0 0 3 6 0 0 0 0 6 0 0 0 1 2 4 8 4 10.0 0 6 0 3.0 2 6 ... 2.0 20.0 5.0 10.0 5.0 1993.0 5.0 4.0 1 4.0 4.0 W 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2 3.0 4.0 5.0 2.0 5.0 1.0 3 3 4 2 3 6 7 1 1 1 2 6 1 7 3 0.0 3.0 0.0 2.0 5.0 0.0 23.0 1 1.0 4.0 5.0 6.0 3.0 6.0 9.0 3.0 3 1 4
1 1770 -1 1.0 0.0 NaN NaN NaN NaN 0.0 20.0 0.0 0.0 1.0 21.0 0.0 4.0 7.0 5A 5 31 1.0 5.0 2.0 2.0 5.0 5.0 4.0 5.0 0 0 10 0 0 0 10 10 0.0 0 0 6 0 0 7 0 0 0 0 0 0 0 9 9 10 0.0 0 6 0 3.0 2 6 ... 1.0 6.0 2.0 1.0 1.0 1992.0 1.0 2.0 1 1.0 5.0 W 2.0 3.0 2.0 1.0 1.0 3.0 5.0 5 4.0 5.0 5.0 1.0 5.0 1.0 3 7 4 3 3 4 5 4 3 3 3 6 1 6 2 0.0 3.0 0.0 4.0 3.0 0.0 0.0 1 1.0 1.0 5.0 2.0 1.0 6.0 9.0 5.0 3 1 4
2 1465 2 9.0 16.0 NaN NaN NaN NaN 11.0 2.0 0.0 0.0 4.0 2.0 0.0 4.0 1.0 7A 7 41 2.0 5.0 2.0 2.0 5.0 5.0 5.0 5.0 0 0 10 0 0 0 10 10 0.0 0 0 0 0 0 6 0 0 0 0 0 0 0 10 10 10 0.0 7 0 0 9.0 8 0 ... 5.0 40.0 12.0 10.0 5.0 1992.0 2.0 3.0 1 0.0 6.0 W 3.0 2.0 1.0 0.0 1.0 5.0 4.0 6 6.0 5.0 5.0 2.0 5.0 1.0 5 7 1 6 4 2 7 1 4 3 1 4 3 3 3 0.0 3.0 0.0 1.0 5.0 1.0 15.0 1 1.0 3.0 9.0 6.0 3.0 2.0 9.0 4.0 3 2 4
3 1470 -1 7.0 0.0 NaN NaN NaN NaN 0.0 1.0 0.0 0.0 0.0 1.0 0.0 4.0 1.0 2B 2 13 4.0 5.0 2.0 1.0 5.0 5.0 5.0 5.0 0 0 10 0 0 0 10 10 0.0 0 0 0 0 0 0 0 0 0 0 6 0 0 8 8 8 0.0 0 0 0 9.0 8 6 ... 0.0 0.0 0.0 3.0 2.0 1992.0 5.0 5.0 1 3.0 8.0 O 4.0 1.0 0.0 0.0 1.0 3.0 3.0 5 5.0 4.0 5.0 1.0 3.0 1.0 5 7 1 6 7 2 7 2 2 3 2 3 3 2 3 0.0 3.0 0.0 2.0 5.0 0.0 10.0 2 1.0 2.0 6.0 6.0 3.0 NaN 9.0 2.0 3 2 4
4 1478 1 1.0 21.0 NaN NaN NaN NaN 13.0 1.0 0.0 0.0 4.0 1.0 0.0 3.0 6.0 5A 5 31 6.0 1.0 3.0 2.0 4.0 4.0 4.0 3.0 3 4 2 2 2 0 8 2 10.0 6 6 6 5 0 3 0 2 0 5 5 4 6 2 3 2 7.0 7 2 5 1.0 1 7 ... 5.0 37.0 12.0 9.0 4.0 1992.0 5.0 5.0 1 5.0 3.0 W 4.0 2.0 0.0 0.0 1.0 2.0 1.0 8 4.0 2.0 3.0 3.0 2.0 3.0 6 7 2 5 4 2 5 2 3 3 2 5 3 3 3 0.0 2.0 0.0 5.0 5.0 0.0 0.0 1 1.0 1.0 2.0 4.0 3.0 3.0 9.0 7.0 4 2 4

5 rows × 366 columns

In [45]:
mailout_test.shape
Out[45]:
(42833, 366)
In [46]:
# Copying LNR column for Kaggle submission csv file
test_LNR = mailout_test['LNR']

Data Preparation

In [47]:
X_test, _ = prepare_mailout(mailout_test, test=True)
In [48]:
# Check that all columns are the same between training and test set
assert set(X_train.columns) == set(X_test.columns)

Prediction

In [49]:
test_preds = pipeline.predict_proba(X_test)

Kaggle Submission

In [50]:
submission = pd.DataFrame({'LNR':test_LNR.astype(np.int32), 'RESPONSE':test_preds[:, 1]})
submission.head()
Out[50]:
LNR RESPONSE
0 1754 0.817733
1 1770 0.557044
2 1465 0.264191
3 1470 0.256978
4 1478 0.256926
In [51]:
submission.to_csv('kaggle.csv', index=False)

Conclusion

To summarize what we have done in the project so far:

  1. We explored the general population dataset to understand how it should be cleaned for our analysis
  2. We made a pipeline for cleaning the general population dataset and any dataset that has a similar structure
  3. We performed dimensionality reduction on the general population dataset followed by a clustering analysis of the population
  4. We cleaned the customers' dataset using the pipeline we previously made and analyzed the clusters' representation of the busniess's customer base
  5. We analyzed the characterstics of our customer base and how they differ from non-customers
  6. We analyzed the differences between differen clusters in the customer base
  7. We explored the mailout dataset and made a pipeline to prepare the dataset for the supervised learning task
  8. We analyzed different algorithms and metrics then selected the best ones that suited the situation of the dataset we have which was Balanced Random Forests for the algorithm and Macro Recall for metric which deals with the target class imbalance
  9. We tested using feature selection in the pipeline to improve the results and found that it had indeed improved it
  10. We made a final pipeline and tuned it's hyperparameters to predictof individuals with high probability responding the mail-out campaign

Cleaining the general population dataset was really challenging for me, as it was the first time I've ever dealt with dataset this size, and I didn't know where to begin in exploring it.

This has forced me to find some methods to be able to digest the data in smaller portions to get a general idea about how it should be dealt with, like to into categories separately.

Also, I really enjoyed the customer segmentation part, even though it could be improved since we only tried reducing dimensionality using PCA and clustered using K-Means, where we could have used different algorithms for clustering such as DBSCAN, Agglomerative Clustering or Gaussian Mixture Models.